Amit Patriwala

Posted on Jun 27 • Originally published at Medium

PART 1 - How I Design Production-Ready LLM Infrastructure

Most LLM tutorials stop after making the first API call.

That's where the real work actually begins.

After building enterprise AI applications, I realized that the language model itself is only one component of a production system.

The real challenge is designing the infrastructure around it.

In this article, I'll share the architecture I use when designing production-ready LLM platforms.

The Architecture

Step 1 — Start with an API Gateway

Never expose your LLM directly.

Every request should first pass through an API Gateway responsible for:

Authentication
Rate limiting
Logging
Request validation
API versioning

Example technologies:

Azure API Management
Kong
NGINX
Envoy

Step 2 — Add a Prompt Router

Not every request needs GPT-4.

Examples:

FAQ → Small model
Code generation → Coding model
Long reasoning → Large model
Internal documents → Local model

Routing requests can significantly reduce inference costs.

Step 3 — Build a Dedicated Embedding Service

Don't generate embeddings inside your application.

Create a separate service responsible for:

Chunking
Metadata
Embeddings
Versioning

This makes re-indexing much easier later.

Step 4 — Store Vectors

Popular choices include:

Qdrant
pgvector
Azure AI Search
Pinecone
Weaviate

Choose based on scale and operational needs.

Step 5 — Add an LLM Gateway

Instead of calling OpenAI directly from your application:

Application
↓
LLM Gateway
↓
OpenAI / Claude / Local Models

Benefits include:

Provider abstraction
Retry logic
Failover
Usage tracking
Cost reporting

Step 6 — Never Skip Observability

Track:

Latency
Token usage
Cost
Prompt failures
Cache hit rate
Retrieval quality

Without these metrics, optimizing your AI platform becomes difficult.

Common Mistakes

I often see teams making these mistakes:

❌ Hardcoding OpenAI calls

❌ No prompt routing

❌ No monitoring

❌ No caching

❌ Embeddings mixed into business logic

These choices may work for prototypes but usually become painful in production.

My Recommended Production Stack

API Gateway
Authentication
Prompt Router
Prompt Cache
Embedding Service
Vector Database
LLM Gateway
Monitoring
Logging

Keeping these responsibilities separate makes the platform easier to maintain and evolve.

Final Thoughts

The LLM is only one part of the system.

The infrastructure around it determines whether your AI application is scalable, secure, and maintainable.

How are you designing your production AI stack?

I'd be interested to hear what components you've found essential—or which ones you wish you'd added sooner.

DEV Community