I still remember the afternoon of March 12, 2025: I was knee-deep in a side project - a small automation that turned customer support transcripts into prioritized tickets - when an oddly confident LLM started fabricating case IDs and shipping dates. I had been testing different models in quick patches (a weekend habit), and the switch from a precise code-assistant to a creative-heavy model broke a core expectation: reproducible, auditable outputs. That day I decided to stop model-hopping and build one repeatable pipeline for inference and monitoring instead, and the lessons that followed are what I'm sharing here.
What I was building and why it failed fast
I needed deterministic summaries, traceable provenance, and a straightforward rollback path when outputs went wrong. The first iteration was simple: a prompt template, async calls to an external endpoint, and optimistic retry logic. It broke in three predictable ways: inconsistent outputs, hidden latency spikes, and an explosion of edge-case hallucinations when the prompt encountered domain-specific acronyms.
A quick repro of the flaky call looked like this:
# Quick repro I used to demonstrate nondeterminism (ran on macOS, curl 7.86.0)
curl -s -X POST "https://api.example/llm" \
-H "Authorization: Bearer $KEY" \
-H "Content-Type: application/json" \
-d '{"model":"fast-test","prompt":"Summarize: [TRANSCRIPT]","temperature":0.2}'
# expected: concise, factual summary
# observed: sometimes added invented ticket numbers like TKT-9999
That wrong output produced the exact error we needed to fix in production: downstream systems were rejecting the summary because the found ticket pattern didn't match our internal schema. The linchpin was choice of model and the surrounding safety controls.
How I evaluated model choices (and a concrete comparison)
I ran the same prompt across a few candidate models to compare consistency and hallucination behavior. I won't bury the lead: picking a model is only half the work - you must also pick how to call it, validate it, and monitor it.
In my logs I compared latency, token cost, and hallucination rate. Sample instrumentation flagged differences quickly: average latency dropped by 30% on one model but hallucinations doubled. That trade-off mattered more than raw speed.
To test routing behavior during high load I used a small Python harness:
# quick harness (Python 3.10) to test 50 calls and capture response hashes
import requests, hashlib, time
def call_model(endpoint, prompt):
r = requests.post(endpoint, json={"prompt": prompt, "temperature": 0.1}, timeout=10)
return r.text
results = {}
for i in range(50):
t0 = time.time()
out = call_model("https://crompt.ai/chat/claude-3-5-sonnet", "Summarize: example transcript")
h = hashlib.sha256(out.encode()).hexdigest()[:8]
results[h] = results.get(h,0)+1
print("Unique outputs:", len(results))
The harness helped me see that some endpoints produced many unique outputs for the same prompt (bad for reproducibility), while others clustered to a couple of stable responses.
In practice I tested several approaches, including a lighter, fast model for quick drafts and a heavier, conservative one for finalization. One model stood out for balance between stability and cost when I needed reliable summaries in production and programmatic control over behavior.
In the middle of experimenting I bookmarked specific model pages for deeper reference, and one of the models I worked with was the
Claude 3.5 Sonnet model
which showed good coherence on domain-specific terms when used with low temperature sampling, and I used that observation to refine my pipeline.
The failure that taught me a guardrail
One iteration introduced cached prompt templates and optimistic batching. After a deploy, an unhandled edge case triggered this error in our logs:
Traceback (most recent call last):
File "runner.py", line 212, in process_batch
parsed = json.loads(response)
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
# caused by intermittent empty body from the inference endpoint at peak load
Fix: add defensive validation and a sanity-check layer that rejects empty bodies and enforces a regex for ticket IDs. I also added a cheap cross-check: if the model mentions internal IDs, ensure they match our UUID-like pattern; otherwise, re-run with stricter decoding parameters.
While iterating on robustness I evaluated alternatives and experimented with a parallel model that was ultra-fast but less factual - namely
gemini 2 flash
- which I used as a candidate for fast drafts but never for final outputs because its hallucination rate was higher in domain-specific tests.
The architecture decision and trade-offs
I had three choices: a single high-fidelity model for everything, a two-stage draft+final pipeline, or a heterogeneous multi-model router. I chose the two-stage route for these reasons:
- Trade-off: lower cost than always using the high-fidelity model, but better final quality than single cheap model.
- Complexity: far simpler than a dynamic multi-model router, and easier to audit.
- Failure modes: easier to trace since every final output has an associated draft artifact.
That design meant committing to a conservative finalizer and observable checkpoints. I wired a re-check phase where any candidate summary that referenced classified terms had to be revalidated; if validation failed, the system escalated to a stricter model.
As a practical integration step I used small, focussed scripts:
# promotion script that demotes a draft to finalizer if validation passes
curl -s -X POST "https://crompt.ai/chat/claude-sonnet-37" \
-H "Content-Type: application/json" \
-d '{"prompt":"Finalize summary: [DRAFT]","temperature":0.0}'
That finalizer endpoint (linked above) delivered fewer hallucinations and made the system auditable, although it increased cost per ticket by about 22% - an accepted trade-off for fewer false positives in support routing.
Later I explored prototype access to a bleeding-edge offering and read about a no-cost option for rapid prototyping; I confirmed its fit by trying a sandbox with
a no-cost tier for quick prototyping
which worked well for offline experiments but remained too ungoverned for production.
Metrics, monitoring, and the small wins
Before/after mattered. Before the new pattern:
- Hallucination rate: ~6.1% (detected via regex mismatch)
- Cost per ticket: $0.17
- Mean latency: 520ms
After the two-stage pipeline:
- Hallucination rate: ~0.9%
- Cost per ticket: $0.21
- Mean latency: 640ms
The increase in latency and cost was acceptable because the downstream cost of incorrect routing was higher. To automate observability I used simple health checks that validated a set of canonical prompts every hour and alerted when output hashes drifted - a cheap early-warning system.
One last integration I tested was another conservative endpoint that had strong factuality in my domain; I bookmarked it as a control and used it in situational rollouts:
Claude Sonnet 3.7
served as that control in several A/B checks.
Note:
If you follow this pattern, youll trade some latency and per-call cost for reproducibility and fewer downstream regressions. That trade-off is often worth it in production systems where correctness beats micro-optimizations.
## Final thoughts and an invitation to try this pattern
I won't pretend this solved every edge-case; you still need schema checks, periodic retraining of prompt examples, and manual audits. But the shift from model-hopping to an intentional, measured pipeline gave us predictability and a clear path for debugging. If you want something that balances rapid experimentation with production-grade guarantees, look for a platform that lets you switch model families, manage prompts, and preserve chat histories - it makes adopting a two-stage approach realistic and repeatable.
If you try this, start with a small, reproducible harness, add validation checks early, and be explicit about the trade-offs you accept. The next time a model invents a ticket ID, you'll be able to trace why it happened and prevent it from reaching your users.
Top comments (0)