Building Production-Ready Open Source AI Infrastructure: A Technical Guide
Over the past year, we've built and open sourced six production-grade AI infrastructure projects. This isn't toy code or proof of concepts. These are systems handling millions of requests daily in production environments.
Here's what we learned building open source AI infrastructure that actually works.
The Six Projects
- llm-cost-optimization: 3-layer caching plus intelligent routing
- ai-safety-framework: 5-layer defense with 250 red team test cases
- production-rag: 6-stage pipeline with re-ranking and evaluation
- distributed-training: PyTorch DDP with NCCL tuning
- roi-first-ai: Business metric selection and deployment templates
- agentic-ai: Multi-agent orchestration framework
All repositories are at github.com/anilatambharii
Why Open Source Our Production Code
Three reasons.
First, the AI infrastructure landscape is fragmented. Every team rebuilds the same patterns from scratch. LLM caching. RAG pipelines. Cost optimization. Agent orchestration. We've already solved these problems. Sharing the solutions helps the community.
Second, open source code is battle tested. When thousands of developers review, use, and contribute to your code, it gets better fast. Private code stays brittle. Public code gets hardened.
Third, hiring advantage. The best engineers want to work on code that matters. Open source contributions demonstrate technical credibility better than any interview.
Architecture Principle: Composition Over Configuration
Each project is a focused library, not a framework. You compose them together rather than configuring one monolithic system.
Bad approach: One repo with 47 configuration options trying to do everything.
Good approach: Six repos, each solving one problem well. Use what you need. Ignore what you don't.
Example using llm-cost-optimization and production-rag together:
from llm_cost_optimization import CachingLayer, ModelRouter
from production_rag import RAGPipeline, HybridRetriever
# Set up caching for LLM calls
cache = CachingLayer(
semantic_cache_threshold=0.95,
redis_url="redis://localhost:6379"
)
# Set up model routing based on query complexity
router = ModelRouter(
models={
"simple": "claude-haiku-4-5",
"complex": "claude-sonnet-4-6"
},
complexity_threshold=0.7
)
# Set up RAG pipeline with hybrid retrieval
retriever = HybridRetriever(
vector_weight=0.7,
keyword_weight=0.3
)
rag = RAGPipeline(
retriever=retriever,
llm_cache=cache,
llm_router=router
)
# Use them together
result = rag.query("What were Q2 financial results?")
Each component is independent. Each can be used standalone. Together they form a complete system.
Project Deep Dive: LLM Cost Optimization
This project reduced our LLM costs from $47K monthly to $2.8K monthly. 94% cost reduction. Same quality.
Three Layer Caching
Exact match cache catches identical queries. Redis key is SHA256 hash of prompt. Cache hit returns response instantly. No LLM call. Zero cost.
class ExactMatchCache:
def __init__(self, redis_client):
self.redis = redis_client
def get(self, prompt: str) -> Optional[str]:
key = hashlib.sha256(prompt.encode()).hexdigest()
return self.redis.get(f"exact:{key}")
def set(self, prompt: str, response: str, ttl: int = 3600):
key = hashlib.sha256(prompt.encode()).hexdigest()
self.redis.setex(f"exact:{key}", ttl, response)
Hit rate: 23% of queries.
Semantic cache catches similar queries. Embed the prompt. Find nearest neighbors in vector DB. If similarity > threshold (0.95), return cached response.
class SemanticCache:
def __init__(self, embedding_model, vector_db, threshold=0.95):
self.embed = embedding_model
self.db = vector_db
self.threshold = threshold
def get(self, prompt: str) -> Optional[str]:
embedding = self.embed(prompt)
results = self.db.search(embedding, k=1)
if results and results[0].score > self.threshold:
return results[0].cached_response
return None
def set(self, prompt: str, response: str):
embedding = self.embed(prompt)
self.db.insert(embedding, cached_response=response)
Hit rate: 31% of queries not caught by exact match.
Prefix cache reuses computation for prompts with common prefixes. System prompt is usually identical. Few-shot examples are usually identical. Only the user query changes.
Anthropic's prompt caching API handles this automatically. Mark static parts as cacheable.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_query}
]
)
Combined hit rate: 73% of queries serve from cache. 27% hit the LLM. Cost reduced 73% from caching alone.
Intelligent Model Routing
Not every query needs GPT-4 or Claude Opus. Simple queries work fine on Haiku. Complex queries need Sonnet.
Routing strategy:
class ModelRouter:
def route(self, query: str) -> str:
complexity = self.calculate_complexity(query)
if complexity < 0.3:
return "claude-haiku-4-5" # $0.25 per 1M tokens
elif complexity < 0.7:
return "claude-sonnet-4-6" # $3 per 1M tokens
else:
return "claude-opus-4-6" # $15 per 1M tokens
def calculate_complexity(self, query: str) -> float:
# Features: length, question marks, technical terms, etc.
features = self.extract_features(query)
return self.classifier.predict_proba(features)[1]
Trained a simple classifier on 10K labeled examples. "What's the capital of France?" → Haiku. "Analyze this 50 page contract for liability clauses" → Opus.
Result: 89% of queries route to Haiku. 9% to Sonnet. 2% to Opus. Average cost per query drops 88%.
Implementation Notes
Cache invalidation is the hard part. We invalidate based on TTL (1 hour default) and explicit updates. When source data changes, we flush related cache entries.
Monitoring tracks hit rates, latency, cost per query. Dashboard shows cache performance in real time. Alerts fire when hit rate drops below threshold.
Gradual rollout started with 1% of traffic. Measured cache hit rate and accuracy. Ramped to 10%, 50%, 100% over 3 weeks.
Project Deep Dive: Production RAG
We increased RAG accuracy from 52% to 89% by fixing retrieval, not the LLM.
The 6-Stage Pipeline
Stage 1: Query Processing
Don't send raw user queries to vector DB. Expand with synonyms. Extract metadata. Generate context-aware embedding.
class QueryProcessor:
def process(self, query: str) -> ProcessedQuery:
# Extract metadata
metadata = {
"date_range": self.extract_date_range(query),
"department": self.extract_department(query),
"doc_type": self.extract_doc_type(query)
}
# Expand with synonyms
expanded = self.expand_synonyms(query)
# Generate embedding
embedding = self.embed_model(expanded)
return ProcessedQuery(
original=query,
expanded=expanded,
embedding=embedding,
metadata=metadata
)
Stage 2: Vector Database Search
Cosine similarity threshold 0.85. Top-k 50 candidates (not 5, not 10). Use Pinecone with metadata filtering.
results = index.query(
vector=processed_query.embedding,
top_k=50,
filter={
"department": processed_query.metadata["department"],
"date": {"$gte": processed_query.metadata["date_range"][0]}
}
)
Stage 3: Hybrid Search
Combine semantic search (70%) with keyword search (30%) using BM25.
class HybridRetriever:
def retrieve(self, query: ProcessedQuery) -> List[Document]:
# Vector search
vector_results = self.vector_search(query, k=50)
# Keyword search
keyword_results = self.bm25_search(query.expanded, k=50)
# Combine with weights
combined = self.merge_results(
vector_results,
keyword_results,
vector_weight=0.7,
keyword_weight=0.3
)
return combined[:50]
Stage 4: Re-ranking
This single stage improved accuracy by 23%. Use cross-encoder to score each candidate against the actual query.
class Reranker:
def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-12-v2"):
self.model = CrossEncoder(model_name)
def rerank(self, query: str, documents: List[Document]) -> List[Document]:
# Score each doc against query
pairs = [(query, doc.text) for doc in documents]
scores = self.model.predict(pairs)
# Sort by score
ranked = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, score in ranked[:5]]
Top 50 candidates from hybrid search → Re-rank → Best 5 to LLM.
Stage 5: Context Assembly
Smart chunking with overlap. 512 token chunks with 50 token overlap. Include surrounding context. Add metadata.
def assemble_context(ranked_docs: List[Document]) -> str:
context_parts = []
for i, doc in enumerate(ranked_docs):
context_parts.append(f"""
Source {i+1}: {doc.metadata['title']}
Date: {doc.metadata['date']}
Department: {doc.metadata['department']}
{doc.text}
---
""")
return "\n".join(context_parts)
Stage 6: LLM Generation
Force grounded responses. System prompt enforces citation. User query includes assembled context.
system_prompt = """You are a helpful assistant. Use ONLY the provided context to answer questions.
If the context doesn't contain enough information, say "I don't have enough information to answer that question."
Always cite your sources using the Source number."""
user_prompt = f"""Context:
{assembled_context}
Question: {original_query}
Answer:"""
Results
Before: 52% answer accuracy. 3.8s latency. 31% hallucination rate.
After: 89% accuracy (+71%). 1.2s latency (faster!). 4% hallucination rate (-87%).
The insight: Don't optimize the LLM. Optimize the retrieval. GPT-4 with bad context = bad answers. Haiku with perfect context = great answers.
Making Projects Production Ready
Every project includes:
Comprehensive tests: Unit tests for every function. Integration tests for pipelines. End-to-end tests for workflows. 90%+ coverage.
Documentation: README with quick start. Detailed API docs. Architecture diagrams. Example notebooks.
Benchmarks: Performance metrics. Accuracy measurements. Cost comparisons. Real numbers, not claims.
Monitoring: Prometheus metrics. Logging. Error tracking. Observability built in.
Deployment: Docker containers. Kubernetes manifests. Terraform modules. Production ready deployment.
Contributing to Open Source AI
Our projects welcome contributions. Here's how to get started:
- Pick a project that interests you
- Read the CONTRIBUTING.md
- Check the issues for "good first issue" labels
- Submit a PR with tests and documentation
- Respond to review feedback
We review all PRs within 48 hours. Quality bar is high but we help contributors meet it.
Conclusion
Open source AI infrastructure should be production ready, not proof of concept. These six projects represent thousands of hours of real world testing and optimization.
Use them. Contribute to them. Build on them.
The code is at github.com/anilatambharii. Documentation is comprehensive. Examples are plentiful. Issues are welcome.
Let's build better AI infrastructure together.
About the Author
Anil Prasad is Head of Engineering at Ambharii Labs, recognized as one of "100 Most Influential AI Leaders in USA 2024." He builds production-scale AI and data systems for enterprise organizations. Connect on LinkedIn at linkedin.com/in/anilsprasad or visit ambharii.com.
Related Reading



Top comments (0)