Jamie Cole

Posted on Mar 23

The 5 LLM Architecture Patterns That Scale (And 2 That Do Not)

#ai #architecture #llm

After building LLM features for 18 months, here are the architecture patterns I have seen work at scale. And the two that consistently fail.

Patterns That Scale

User Input → Prompt Template → LLM API → Response → User

Simple. Reliable. Easy to debug. Most LLM features should start here.

Query → Vector Search → Context → Prompt → LLM → Response

Good for question answering, knowledge bases, anything requiring specific information.

Task → LLM Planning → Tool Calls → Review → Output

For complex tasks requiring multiple steps. More powerful but harder to debug.

Input → Cache Check → [HIT] → Response
                   → [MISS] → LLM → Cache → Response

Reduces cost and latency for repeated queries. Essential at scale.

LLM Output → Human Review → [APPROVE] → Output
                          → [REJECT] → Retry

For high-stakes decisions. Expensive but necessary for compliance.

User → LLM → Database Write → Response

No validation. No review. Logs, outputs, and destroys data. Works at demo scale. Breaks at production.

Complex Prompt = System + Context + History + Constraints + Examples + ...

A 2000-token prompt that does everything. Impossible to test, debug, or version control.

LLM architecture is software architecture. The same principles apply: modularity, testing, versioning, observability.

If your LLM feature would fail a code review for a microservice, it will fail in production.

Building scalable LLM features? I write about what works in production. Follow along.