After working on production AI systems across fintech, healthcare, and SaaS, I've seen this pattern repeat so consistently that it now has a name in our team: the week-6 demo gap.
The AI demo worked perfectly. Six weeks after launch, users started reporting wrong outputs. Nobody could explain why, because the system was never built to explain why.
Here's what causes it, and the 5 architecture decisions that prevent it.
The Demo Is Not the Product
Every AI demo uses carefully selected examples where the system performs well. Production users are unpredictable — they hit exactly the edge cases the demo never surfaced.
This isn't dishonesty on the part of the development team. It's the natural result of showcasing a system under optimal conditions rather than operating it under production conditions.
The gap:
- Demo inputs: curated, cleaned, representative of the "easy 80%"
- Production inputs: unpredictable, messy, often the "hard 20%" that breaks the system
The 5 Architecture Decisions That Determine Outcome
1. Eval Framework — Built Before Application Code
# Minimal eval framework structure
eval_config = {
"test_set_path": "./eval/production_samples_500.jsonl",
"metrics": ["precision_at_5", "factual_accuracy", "format_compliance"],
"regression_threshold": {
"precision_at_5": 0.05, # max allowed drop before blocking
"factual_accuracy": 0.03,
"format_compliance": 0.02
},
"human_eval_sample_rate": 0.02 # 2% of production calls sampled
}
Without this: you make a change, it looks good on 10 examples, you ship it. Two weeks later: users report a regression that wasn't in your 10 examples. With this: every change is validated against 500 representative labelled examples before shipping.
Build this in week 1. Not week 10.
2. Confidence Thresholding — Route Low-Confidence Outputs
def route_ai_response(response, context):
confidence = calculate_confidence(response, context)
if confidence >= 0.85:
return {"status": "serve", "response": response}
elif confidence >= 0.70:
return {"status": "serve_with_caveat",
"response": response,
"disclaimer": "Note: This response may have lower accuracy on this specific query."}
else:
return {"status": "fallback",
"response": "I don't have confident information on this specific question.",
"escalate": True}
The system should know when it doesn't know. This is not an optional quality-of-life feature. In regulated industries (fintech, healthcare), a system that presents guesses as facts with equal confidence is a compliance risk.
3. Graceful Degradation — Design Every Failure Path
async def ai_feature_handler(user_query: str, context: dict) -> AIResponse:
try:
response = await ai_client.generate(user_query, context, timeout=4.0)
if not validate_response_format(response):
return fallback_response("format_invalid")
if get_confidence(response) < CONFIDENCE_THRESHOLD:
return fallback_response("low_confidence")
return AIResponse(content=response, status="success")
except TimeoutError:
return fallback_response("timeout")
except RateLimitError:
return fallback_response("rate_limited")
except Exception as e:
log_ai_error(e, user_query, context)
return fallback_response("unexpected_error")
def fallback_response(reason: str) -> AIResponse:
"""Always return something useful, never break silently."""
messages = {
"timeout": "Taking longer than expected. Please try again in a moment.",
"low_confidence": "I don't have enough information to answer this confidently.",
"rate_limited": "High demand right now. Please retry in 30 seconds.",
"format_invalid": "Unable to process response. Please rephrase your question.",
"unexpected_error": "Something went wrong. Our team has been notified."
}
return AIResponse(content=messages.get(reason, "An error occurred."), status="fallback", reason=reason)
The failure path needs as much design attention as the success path. In most systems we audit, failure handling is an afterthought.
4. Retrieval Quality Monitoring — Separate from Generation Quality
class RetrievalMonitor:
def log_retrieval_event(self, query, retrieved_chunks, latency_ms):
# Track separately from generation quality
self.metrics.record({
"query_hash": hash(query),
"retrieval_latency_ms": latency_ms,
"chunks_returned": len(retrieved_chunks),
"avg_relevance_score": mean([c.score for c in retrieved_chunks]),
"top_relevance_score": max([c.score for c in retrieved_chunks]),
"low_confidence_retrieval": max([c.score for c in retrieved_chunks]) < 0.7
})
def alert_on_degradation(self, window_minutes=60):
recent_events = self.get_events_in_window(window_minutes)
low_confidence_rate = sum(1 for e in recent_events if e["low_confidence_retrieval"]) / len(recent_events)
if low_confidence_rate > 0.15: # >15% of queries returning low-confidence retrievals
self.send_alert(f"Retrieval quality degraded: {low_confidence_rate:.1%} low-confidence rate in last {window_minutes}min")
Retrieval and generation fail independently. A system can have good generation quality on easy queries and silently terrible retrieval on hard queries. End-to-end metrics don't surface this. You need separate retrieval monitoring.
5. Model Version Pinning — No Surprise Breaking Changes
# ai_config.yaml
models:
primary:
provider: openai
model: "gpt-4o-2024-08-06" # Pinned — not "gpt-4o" (auto-updates)
fallback: "gpt-4o-mini-2024-07-18"
embedding:
provider: openai
model: "text-embedding-3-small"
version: "2"
deployment:
auto_upgrade: false
change_management: required
test_before_upgrade: true
"Latest" is not a production model version. Pin everything. Test model upgrades in staging with your eval suite before promoting to production.
The One Question That Tests All 5
Before signing any AI development contract, ask the vendor:
"Can you run your demo on 20 inputs I select, including our messiest real-world examples?"
Teams who've built for production say yes immediately.
Teams who've built impressive demos find qualifications: "we'd need to clean it first," "that's a slightly different use case," "we'll address that in phase 2."
Those qualifications are the answer.
Summary
| Gap | Prevention |
|---|---|
| No eval framework | Build before week 1 application code |
| No confidence handling | Implement thresholding + routing |
| No graceful degradation | Design every failure path |
| No retrieval monitoring | Separate retrieval metrics from generation |
| Model version surprises | Pin all model versions |
What production AI gap has been hardest to catch before it affected users? Drop your experience in the comments, genuinely useful to compare patterns across different domains.
Sunil writes about production AI engineering from the Ailoitte team, we build 12-week AI Velocity Pods engagements for fintech, healthcare, and SaaS companies.
Top comments (0)