“Anyone can build a demo. Shipping AI to production is a completely different sport.”
AI-powered web apps are everywhere right now—but most articles stop at “call the model and render the result.”
This post is about what happens after the demo works.
I’ll walk through how I design production-grade AI web systems today:
architecture, performance constraints, cost control, failure modes, and the mistakes I’ve already paid for—so you don’t have to.
This is not beginner content. If you’re building real systems, this is for you.
1. The Real AI Web Stack (Not the Blog-Tutorial Version)
A serious AI web application is not just:
React → API → LLM → Response
In production, the stack actually looks more like this:
Client (Web / Mobile)
↓
BFF (Backend-for-Frontend)
↓
AI Orchestrator Layer
├── Prompt Assembly
├── Context Retrieval (RAG)
├── Tool / Function Calling
├── Caching & Deduplication
├── Cost Guards
↓
Model Providers (LLMs, Vision, Speech)
↓
Post-Processing & Validation
Key insight:
👉 LLMs should never be called directly from your core business API.
Treat AI like an unreliable but powerful subsystem, not a trusted function.
2. Why You Need an AI Orchestrator (Even If You’re Solo)
The biggest mistake I see:
“We’ll refactor later once usage grows.”
You won’t. You’ll ship hacks into production and live with them.
The AI Orchestrator owns:
- Prompt versioning
- Input normalization
- Retry + fallback logic
- Model routing (cheap vs expensive)
- Safety filters
- Observability (tokens, latency, cost)
Even a thin orchestration layer saves you weeks later.
Example (simplified):
const response = await ai.run({
task: "summarize",
input,
constraints: {
maxTokens: 500,
temperature: 0.3
},
fallbackModel: "gpt-4o-mini"
});
This abstraction is boring—until it’s the reason your app survives traffic.
3. Prompt Engineering Is a Software Engineering Problem
Prompts are code, whether we like it or not.
What breaks in real systems:
- Tiny wording changes causing regressions
- Model updates changing output shape
- Silent failures that “look” valid
What actually works:
- Typed outputs (JSON schemas)
- Prompt versioning
- Contract tests
Example prompt contract:
{
"type": "object",
"properties": {
"summary": { "type": "string" },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 }
},
"required": ["summary", "confidence"]
}
If the model violates the contract → reject and retry.
Never blindly trust AI output in production.
4. RAG Is Not “Just Add a Vector DB”
Retrieval-Augmented Generation (RAG) is powerful—and widely misunderstood.
Common failures:
- Chunk sizes chosen randomly
- No metadata filtering
- Re-embedding the same content endlessly
- Treating similarity score as “truth”
What works better:
- Task-specific chunking
- Hybrid search (vector + keyword)
- Aggressive caching
- Domain-specific embeddings
Hard lesson:
The quality of your retrieved context matters more than the model.
A smaller model + clean context beats a massive model + noisy data every time.
5. Latency Is the Silent Killer
Your users don’t care how smart your AI is if it feels slow.
Practical techniques:
- Stream responses (always)
- Pre-warm embeddings
- Cache semantic intent, not raw text
- Parallelize retrieval + validation
Example mental model:
“If this were a database query, would I accept this latency?”
If not—optimize.
6. Cost Control Is a Feature, Not an Afterthought
AI costs scale non-linearly with success.
What I now do by default:
- Token budgets per request
- Daily cost caps
- Model downgrades under load
- Hard limits for anonymous users
Rule of thumb:
If you don’t know your cost per request, you don’t have a business.
7. AI Failures Must Be Designed For
LLMs fail in creative ways:
- Confidently wrong answers
- Partial outputs
- Timeout hallucinations
- Format drift
Your UI should assume:
- “This might be wrong”
- “This might be slow”
- “This might fail silently”
Good AI UX is about graceful degradation, not perfection.
8. Observability: Log the Intent, Not Just the Error
Traditional logs are useless for AI systems.
What you should log:
- Prompt version
- Model used
- Token counts
- Latency
- Confidence scores
- User feedback signals This turns “AI feels worse lately” into something debuggable.
9. What I’d Do Differently If I Started Again
Hard-earned lessons:
- Build orchestration first
- Treat prompts as code
- Optimize context, not models
- Add cost guards early
- Expect failure, design for it
AI doesn’t replace engineering discipline—it demands more of it.
Final Thought
We’re not in the “AI hype phase” anymore.
We’re in the phase where:
- Architecture matters again
- Engineering judgment beats novelty
- The winners ship reliable systems
If you’re building AI-powered web apps today, you’re not just writing code—you’re designing probabilistic software.
And that changes everything.
Top comments (1)
Hey Art! Great post and we were looking to do something very similar with Flywheel.
I learned this real fast when I kept getting non 200s back from Claude's API and found out they were in an outage so now I have an check that reads their status page and caches the result. This way we can disable our AI functionality and inform our users why.
Curious what ya'll are using to stream your message? We're using SSE and its been pretty great for us.
Another thing we haven't run into but know will be in the future is token costs. Even testing is $$$.
Anyways, thanks for the post!