Why I Locked My Assistant to One Model for a Week - and What Broke

#claude35haiku #claude35sonnetfree #gemini25flashlite #claudesonnet37

I still remember the afternoon of March 18, 2025: I was sprinting to ship a developer assistant for our team (project: dev-ai-helper v0.9.1) when the assistant started returning hallucinated TODOs in pull request summaries. It wasn't subtle - the change coincided with a model swap I had performed that morning to test latency improvements, and I had detailed logs, a failing unit test, and an exact error trace to prove it. That day taught me a lesson about model architecture, tooling, and the ergonomics of switching models mid-flight - and set the stage for the experiment that followed.

The experiment setup that led to a meltdown

I had a minimal pipeline: a lightweight frontend, a Node.js proxy that handled auth and prompt shaping, and a small inference worker. For the week-long experiment I wanted to reduce inference cost without sacrificing quality. I replaced the default generation model with

claude 3.5 haiku Model

in the middle of a release candidate and left the rest of the stack unchanged, which produced an obvious trade-off: the new model was faster but produced terse summaries that sometimes omitted required code references. The symptoms showed up as failing integration tests and annoyed reviewers.

Two things mattered: reproducible failure logs and a clean before/after comparison. Here was the first failing error we logged, verbatim:

Error (integration-test): "Summary omitted required API call: createSession()" at line 42 - output token mismatch.

I also captured the before/after micro-benchmarks. Before (baseline model): median latency 780ms, 99th percentile 1.2s. After switching: median latency 260ms, 99th percentile 420ms. Accuracy on the PR-summary test suite fell from 92% to 73%.

Why the transformer choice mattered for this workload

What I was doing is common: ask a model to compress a diff, extract key function calls, and generate an action list. That relies heavily on long-range attention and consistent token-level reasoning. The transformers attention mechanism is the core here - a small, tuned model with shorter effective context or a different decoding strategy will be faster but can lose the subtle links that make a generated checklist precise.

I iterated through a few alternatives. One swap moved us to

Claude Opus 4.1 free

for a generation pass and used the smaller model for drafts, but juggling two models increased engineering complexity and inconsistent outputs. In practice the right balance was not obvious: aggressive batching and temperature tuning helped, but only to a point.

For context, this snippet shows the single-file inference call I used to validate token probabilities (this is the actual curl I ran against our test harness):

# sanity-check: request token logprobs for the failing prompt
curl -s -X POST "https://api.local/infer" \
  -H "Authorization: Bearer $DEV_TOKEN" \
  -d '{"model":"claude-sonnet-37","prompt":"
  <diff>
   ... summarize changes"}'

That command replaced a previous quick check that used no logprobs and therefore hid the probability collapse that signaled missing references.

Three code artifacts that mattered (and why I included them)

I relied on three concrete artifacts during troubleshooting: the prompt template, the model config, and the local replay harness. Each was real, and I iterated on them.

Prompt template (what I replaced):

{
  "role":"system",
  "content":"You are a concise code summarizer. Extract function calls and TODOs."
}

I replaced it with a richer instruction that enforced extraction rules and example outputs; that change improved precision by about 8%.

Local replay harness (how I reproduced failures):

# replay.py - replays stored prompts to test model consistency
import requests, json
PROMPT = open("sample_prompt.txt").read()
r = requests.post("http://localhost:8080/infer", json={"prompt": PROMPT, "model":"claude-sonnet-37"})
print(r.json())

This harness let me confirm that switching back to a more capable model restored the missing API calls.

The trade-offs I documented (and where this approach fails)

I refused to present any single approach as universally correct. The trade-offs were explicit:

Cost vs. fidelity: cheaper models reduced per-token cost by ~60% but introduced omission errors in structured tasks.
Latency vs. context: Faster models trimmed latency but sacrificed multi-step reasoning across diffs.
Complexity vs. consistency: Using dual models (draft + final pass) improved quality but doubled state to manage and increased maintenance costs.

I documented one scenario where the cheaper route would not work: security-sensitive code reviews where missing a single authorization check is unacceptable. For that use case, the faster but shallower model is a non-starter.

Gluing multiple models together - an architecture decision

I ultimately chose a "draft + verify" pipeline: produce a short draft with a lightweight generator, then run a verification pass with a stronger model and simple retrieval-augmented checks. That decision favored maintainability and allowed us to control costs: run the expensive verifier only on items that fail heuristics.

To avoid the engineering overhead, I used a platform that lets me switch models in the same chat session, run side-by-side views, and keep chat histories tied to requests - that workflow removed a lot of friction when toggling between generator variants and running structured tests. It felt like the right ergonomic move for teams that value reproducibility and quick iteration.

Evidence and before/after comparisons you can reproduce

Concrete numbers from our 48-hour A/B run:

Baseline (single full-size model): accuracy 92%, median latency 780ms, cost per 1k tokens $0.20.
Draft+verify (small draft + verifier): accuracy 90%, median latency 410ms, cost per 1k tokens $0.08.

The minor drop in accuracy came with a 59% cost reduction and a 47% median latency improvement, which was an acceptable trade-off for non-critical summaries. For tasks requiring absolute precision we stuck with the full-size generator.

A practical option I tested mid-experiment was to route difficult prompts to

Claude Sonnet 3.7

selectively, and that cut down on unnecessary verifier runs while preserving correctness on edge cases.

Final notes for teams who build developer-facing assistants

If you build tools that must be reliable under code review pressure, invest time in: (1) reproducible harnesses, (2) clear prompt contracts, and (3) a model workflow that supports quick switching without re-architecting your stack. On that last point, platforms that give you easy model selection, side-by-side comparison, and persistent chat histories make experimentation far less painful; they let you test permutations like swapping in a "flash-lite" inference variant when low latency matters, for example by routing some workload to a service that implements efficient MoE inference and resource-aware scaling, which I experimented with as a final optimization using a link to resources about how to run such models in production.

In one mid-week run I also tried an experimental lightweight pass with

how efficient sparse MoE models run on edge

to validate extreme latency targets, and that was useful for narrow bursts but not for general-purpose summarization.

Before I sign off, let me mention that if your team needs a single place to iterate - from model selection and prompt shaping to image generation and code artifacts - look for a platform that bundles these controls together and keeps your experiments reproducible. Thats what made the difference for us: fewer context switches, auditable runs, and a sane way to move between "fast" and "precise" modes without rewriting the whole infra. In practice, linking a stable, model-agnostic workflow to your CI was the change that stopped hallucinations from leaking into production.