Anil Prasad

Posted on May 19

Building Production-Ready Open Source AI Infrastructure: A Technical Guide

#ai #opensource #machinelearning #datascience

Building Production-Ready Open Source AI Infrastructure: A Technical Guide

Over the past year, we've built and open sourced six production-grade AI infrastructure projects. This isn't toy code or proof of concepts. These are systems handling millions of requests daily in production environments.

Here's what we learned building open source AI infrastructure that actually works.

The Six Projects

llm-cost-optimization: 3-layer caching plus intelligent routing
ai-safety-framework: 5-layer defense with 250 red team test cases
production-rag: 6-stage pipeline with re-ranking and evaluation
distributed-training: PyTorch DDP with NCCL tuning
roi-first-ai: Business metric selection and deployment templates
agentic-ai: Multi-agent orchestration framework

All repositories are at github.com/anilatambharii

Why Open Source Our Production Code

Three reasons.

First, the AI infrastructure landscape is fragmented. Every team rebuilds the same patterns from scratch. LLM caching. RAG pipelines. Cost optimization. Agent orchestration. We've already solved these problems. Sharing the solutions helps the community.

Second, open source code is battle tested. When thousands of developers review, use, and contribute to your code, it gets better fast. Private code stays brittle. Public code gets hardened.

Third, hiring advantage. The best engineers want to work on code that matters. Open source contributions demonstrate technical credibility better than any interview.

Architecture Principle: Composition Over Configuration

Each project is a focused library, not a framework. You compose them together rather than configuring one monolithic system.

Bad approach: One repo with 47 configuration options trying to do everything.

Good approach: Six repos, each solving one problem well. Use what you need. Ignore what you don't.

Example using llm-cost-optimization and production-rag together:

from llm_cost_optimization import CachingLayer, ModelRouter
from production_rag import RAGPipeline, HybridRetriever

# Set up caching for LLM calls
cache = CachingLayer(
    semantic_cache_threshold=0.95,
    redis_url="redis://localhost:6379"
)

# Set up model routing based on query complexity
router = ModelRouter(
    models={
        "simple": "claude-haiku-4-5",
        "complex": "claude-sonnet-4-6"
    },
    complexity_threshold=0.7
)

# Set up RAG pipeline with hybrid retrieval
retriever = HybridRetriever(
    vector_weight=0.7,
    keyword_weight=0.3
)

rag = RAGPipeline(
    retriever=retriever,
    llm_cache=cache,
    llm_router=router
)

# Use them together
result = rag.query("What were Q2 financial results?")

Each component is independent. Each can be used standalone. Together they form a complete system.

Project Deep Dive: LLM Cost Optimization

This project reduced our LLM costs from $47K monthly to $2.8K monthly. 94% cost reduction. Same quality.

Three Layer Caching

Exact match cache catches identical queries. Redis key is SHA256 hash of prompt. Cache hit returns response instantly. No LLM call. Zero cost.

class ExactMatchCache:
    def __init__(self, redis_client):
        self.redis = redis_client

    def get(self, prompt: str) -> Optional[str]:
        key = hashlib.sha256(prompt.encode()).hexdigest()
        return self.redis.get(f"exact:{key}")

    def set(self, prompt: str, response: str, ttl: int = 3600):
        key = hashlib.sha256(prompt.encode()).hexdigest()
        self.redis.setex(f"exact:{key}", ttl, response)

Hit rate: 23% of queries.

Semantic cache catches similar queries. Embed the prompt. Find nearest neighbors in vector DB. If similarity > threshold (0.95), return cached response.

class SemanticCache:
    def __init__(self, embedding_model, vector_db, threshold=0.95):
        self.embed = embedding_model
        self.db = vector_db
        self.threshold = threshold

    def get(self, prompt: str) -> Optional[str]:
        embedding = self.embed(prompt)
        results = self.db.search(embedding, k=1)

        if results and results[0].score > self.threshold:
            return results[0].cached_response
        return None

    def set(self, prompt: str, response: str):
        embedding = self.embed(prompt)
        self.db.insert(embedding, cached_response=response)

Hit rate: 31% of queries not caught by exact match.

Prefix cache reuses computation for prompts with common prefixes. System prompt is usually identical. Few-shot examples are usually identical. Only the user query changes.

Anthropic's prompt caching API handles this automatically. Mark static parts as cacheable.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_query}
    ]
)

Combined hit rate: 73% of queries serve from cache. 27% hit the LLM. Cost reduced 73% from caching alone.

Intelligent Model Routing

Not every query needs GPT-4 or Claude Opus. Simple queries work fine on Haiku. Complex queries need Sonnet.

Routing strategy:

class ModelRouter:
    def route(self, query: str) -> str:
        complexity = self.calculate_complexity(query)

        if complexity < 0.3:
            return "claude-haiku-4-5"  # $0.25 per 1M tokens
        elif complexity < 0.7:
            return "claude-sonnet-4-6"  # $3 per 1M tokens
        else:
            return "claude-opus-4-6"    # $15 per 1M tokens

    def calculate_complexity(self, query: str) -> float:
        # Features: length, question marks, technical terms, etc.
        features = self.extract_features(query)
        return self.classifier.predict_proba(features)[1]

Trained a simple classifier on 10K labeled examples. "What's the capital of France?" → Haiku. "Analyze this 50 page contract for liability clauses" → Opus.

Result: 89% of queries route to Haiku. 9% to Sonnet. 2% to Opus. Average cost per query drops 88%.

Implementation Notes

Cache invalidation is the hard part. We invalidate based on TTL (1 hour default) and explicit updates. When source data changes, we flush related cache entries.

Monitoring tracks hit rates, latency, cost per query. Dashboard shows cache performance in real time. Alerts fire when hit rate drops below threshold.

Gradual rollout started with 1% of traffic. Measured cache hit rate and accuracy. Ramped to 10%, 50%, 100% over 3 weeks.

Project Deep Dive: Production RAG

We increased RAG accuracy from 52% to 89% by fixing retrieval, not the LLM.

The 6-Stage Pipeline

Stage 1: Query Processing

Don't send raw user queries to vector DB. Expand with synonyms. Extract metadata. Generate context-aware embedding.

class QueryProcessor:
    def process(self, query: str) -> ProcessedQuery:
        # Extract metadata
        metadata = {
            "date_range": self.extract_date_range(query),
            "department": self.extract_department(query),
            "doc_type": self.extract_doc_type(query)
        }

        # Expand with synonyms
        expanded = self.expand_synonyms(query)

        # Generate embedding
        embedding = self.embed_model(expanded)

        return ProcessedQuery(
            original=query,
            expanded=expanded,
            embedding=embedding,
            metadata=metadata
        )

Stage 2: Vector Database Search

Cosine similarity threshold 0.85. Top-k 50 candidates (not 5, not 10). Use Pinecone with metadata filtering.

results = index.query(
    vector=processed_query.embedding,
    top_k=50,
    filter={
        "department": processed_query.metadata["department"],
        "date": {"$gte": processed_query.metadata["date_range"][0]}
    }
)

Stage 3: Hybrid Search

Combine semantic search (70%) with keyword search (30%) using BM25.

class HybridRetriever:
    def retrieve(self, query: ProcessedQuery) -> List[Document]:
        # Vector search
        vector_results = self.vector_search(query, k=50)

        # Keyword search
        keyword_results = self.bm25_search(query.expanded, k=50)

        # Combine with weights
        combined = self.merge_results(
            vector_results, 
            keyword_results,
            vector_weight=0.7,
            keyword_weight=0.3
        )

        return combined[:50]

Stage 4: Re-ranking

This single stage improved accuracy by 23%. Use cross-encoder to score each candidate against the actual query.

class Reranker:
    def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(self, query: str, documents: List[Document]) -> List[Document]:
        # Score each doc against query
        pairs = [(query, doc.text) for doc in documents]
        scores = self.model.predict(pairs)

        # Sort by score
        ranked = sorted(
            zip(documents, scores),
            key=lambda x: x[1],
            reverse=True
        )

        return [doc for doc, score in ranked[:5]]

Top 50 candidates from hybrid search → Re-rank → Best 5 to LLM.

Stage 5: Context Assembly

Smart chunking with overlap. 512 token chunks with 50 token overlap. Include surrounding context. Add metadata.

def assemble_context(ranked_docs: List[Document]) -> str:
    context_parts = []

    for i, doc in enumerate(ranked_docs):
        context_parts.append(f"""
Source {i+1}: {doc.metadata['title']}
Date: {doc.metadata['date']}
Department: {doc.metadata['department']}

{doc.text}

---
        """)

    return "\n".join(context_parts)

Stage 6: LLM Generation

Force grounded responses. System prompt enforces citation. User query includes assembled context.

system_prompt = """You are a helpful assistant. Use ONLY the provided context to answer questions. 

If the context doesn't contain enough information, say "I don't have enough information to answer that question."

Always cite your sources using the Source number."""

user_prompt = f"""Context:
{assembled_context}

Question: {original_query}

Answer:"""

Results

Before: 52% answer accuracy. 3.8s latency. 31% hallucination rate.

After: 89% accuracy (+71%). 1.2s latency (faster!). 4% hallucination rate (-87%).

The insight: Don't optimize the LLM. Optimize the retrieval. GPT-4 with bad context = bad answers. Haiku with perfect context = great answers.

Making Projects Production Ready

Every project includes:

Comprehensive tests: Unit tests for every function. Integration tests for pipelines. End-to-end tests for workflows. 90%+ coverage.

Documentation: README with quick start. Detailed API docs. Architecture diagrams. Example notebooks.

Benchmarks: Performance metrics. Accuracy measurements. Cost comparisons. Real numbers, not claims.

Monitoring: Prometheus metrics. Logging. Error tracking. Observability built in.

Deployment: Docker containers. Kubernetes manifests. Terraform modules. Production ready deployment.

Contributing to Open Source AI

Our projects welcome contributions. Here's how to get started:

Pick a project that interests you
Read the CONTRIBUTING.md
Check the issues for "good first issue" labels
Submit a PR with tests and documentation
Respond to review feedback

We review all PRs within 48 hours. Quality bar is high but we help contributors meet it.

Conclusion

Open source AI infrastructure should be production ready, not proof of concept. These six projects represent thousands of hours of real world testing and optimization.

Use them. Contribute to them. Build on them.

The code is at github.com/anilatambharii. Documentation is comprehensive. Examples are plentiful. Issues are welcome.

Let's build better AI infrastructure together.

About the Author

Anil Prasad is Head of Engineering at Ambharii Labs, recognized as one of "100 Most Influential AI Leaders in USA 2024." He builds production-scale AI and data systems for enterprise organizations. Connect on LinkedIn at linkedin.com/in/anilsprasad or visit ambharii.com.

Related Reading

DEV Community

Building Production-Ready Open Source AI Infrastructure: A Technical Guide

Building Production-Ready Open Source AI Infrastructure: A Technical Guide

The Six Projects

Why Open Source Our Production Code

Architecture Principle: Composition Over Configuration

Project Deep Dive: LLM Cost Optimization

Three Layer Caching

Intelligent Model Routing

Implementation Notes

Project Deep Dive: Production RAG

The 6-Stage Pipeline

Results

Making Projects Production Ready

Contributing to Open Source AI

Conclusion

Top comments (0)