If you've tried shipping an AI feature to production recently, you know the gap between "demo works in staging" and "prod-stable under real load" is enormous.
This post is about the architecture decisions that close that gap, specifically, the five engineering phases we've converged on after shipping production AI across 14+ industries. No fluff, just the decisions that matter.
The 4 Engineering Failure Modes That Kill AI Timelines
Before the framework, the failure modes. These are not theoretical, every one of them has caused a production incident or a blown timeline in the last 18 months.
1. Token cost explosions in agentic loops
Single-turn LLM calls are predictable. Agentic loops, where an AI takes sequential actions, calls tools, and iterates, are not. Without per-workflow token budgets, you're running an infinite loop on a metered connection.
Here's what unguarded agentic architecture looks like:

We diagnosed a production chatbot burning $400/day per enterprise client. Nobody noticed until month 3, by which point, the feature was destroying margin in real time. The fix:
2. RAG without domain boundaries
The naive RAG setup: dump all your enterprise data into a vector store, let the LLM retrieve whatever it wants. This produces authoritative hallucinations, outputs that are coherent, confident, and wrong because they're blending context from unrelated domains.
Domain-Driven Design applies directly to AI service layers. The principle: an AI workflow accesses only the data collections relevant to its task category. Full stop.

The benefits compound: smaller context windows (lower cost), easier compliance auditing (you know exactly what data informed every decision), and a dramatically reduced hallucination surface area.
3. No observability in production
You are not done shipping when the feature passes staging tests. Production AI requires active monitoring that most teams treat as a post-launch concern. It isn't.
The minimum viable observability stack for production AI:
• Hallucination detection — compare outputs against retrieved source context; flag divergence above a threshold
• Drift detection — monitor output distribution over time; model behavior changes as training data ages
• HITL checkpoints — for high-stakes decisions (loan approvals, patient triage, compliance flags), human review before action
• Decision logs — structured record of: input, retrieved context, model output, confidence score, action taken. Forensic trail for every decision

The LLM landscape shifts quarterly. Lock-in to a single provider is technical debt that compounds with every model release you can't migrate to.
The 5-Phase Delivery Framework
The Billing Model Is an Architectural Decision
This sounds like a business detail. It isn't. The billing model determines every engineering incentive in the engagement.
Under hourly billing: no structural reason to ship faster, optimize token costs, or build durable monitoring. Every inefficiency is revenue. Every extra sprint is billable.
Under outcome-based contracts: speed becomes a margin driver. Token optimization saves the delivery team money. Durable architecture reduces support load. Every incentive aligns with delivery quality.
The market data: seat/hourly AI pricing dropped 21% to 15% of engagements in 2025. Outcome-based surged 27% to 41%.
One More Thing: The Compounding Data Moat
Every production AI deployment generates proprietary training signals, correction patterns, user interactions, and edge cases. These compounds.
An enterprise that deployed in Q1 has 3 quarters of proprietary production data by Q4. A competitor still in planning cycles has none. That data gap doesn't close with a better model selection. It closes slowly, with earlier deployment.
The fastest path to closing it is shipping. This is the whole argument for Velocity PODs.
What's your current production AI stack?
Specifically curious what others are using for observability and hallucination detection in production.
LangSmith? Custom? Something else? Drop it in the comments.


Top comments (0)