After building LLM features for 18 months, here are the architecture patterns I have seen work at scale. And the two that consistently fail.
Patterns That Scale
1. Prompt-as-a-Service
User Input → Prompt Template → LLM API → Response → User
Simple. Reliable. Easy to debug. Most LLM features should start here.
2. Retrieval-Augmented Generation (RAG)
Query → Vector Search → Context → Prompt → LLM → Response
Good for question answering, knowledge bases, anything requiring specific information.
3. Agentic Workflows
Task → LLM Planning → Tool Calls → Review → Output
For complex tasks requiring multiple steps. More powerful but harder to debug.
4. Caching Layer
Input → Cache Check → [HIT] → Response
→ [MISS] → LLM → Cache → Response
Reduces cost and latency for repeated queries. Essential at scale.
5. Human-in-the-Loop
LLM Output → Human Review → [APPROVE] → Output
→ [REJECT] → Retry
For high-stakes decisions. Expensive but necessary for compliance.
Patterns That Do Not Scale
1. Direct Database → LLM → Output
User → LLM → Database Write → Response
No validation. No review. Logs, outputs, and destroys data. Works at demo scale. Breaks at production.
2. Monolithic Prompt Engineering
Complex Prompt = System + Context + History + Constraints + Examples + ...
A 2000-token prompt that does everything. Impossible to test, debug, or version control.
The Key Insight
LLM architecture is software architecture. The same principles apply: modularity, testing, versioning, observability.
If your LLM feature would fail a code review for a microservice, it will fail in production.
Building scalable LLM features? I write about what works in production. Follow along.
Top comments (0)