Programming Central

Posted on Jan 12 • Edited on Jan 30 • Originally published at python.plainenglish.io

Building a Hybrid-Private RAG Platform on AWS: From Prototype to Production with Python

#gemini #python #rag

“It works on my machine.”

In the world of Generative AI, these are the five most dangerous words a developer can utter. The journey from a local Jupyter Notebook running a RAG (Retrieval-Augmented Generation) prototype to a production-grade AI platform that handles thousands of concurrent users is where 90% of AI projects fail.

When you move to an enterprise context, two new monsters appear: Data Privacy and Scalability Costs. You can’t just send your company’s proprietary source code or sensitive financial data to public APIs without serious consideration. You need a Private RAG Platform: a self-hosted, secured, and elastic system that keeps your sensitive data within your infrastructure.

This article provides the architectural blueprint for building a hybrid-private RAG system on AWS using Python. We’ll not only design the application logic but also provision the necessary cloud infrastructure using Python itself, demonstrating the power of Infrastructure as Code with a Pythonic approach.

Understanding the Hybrid-Private Architecture

Before we dive in, let’s clarify what “hybrid-private” means. In enterprise AI systems, we often use a strategic tiered approach:

Public APIs for lightweight, non-sensitive tasks (routing, classification, general queries)
Self-hosted models for processing proprietary data and sensitive operations
Private vector databases for storing your knowledge base

This approach optimizes for both cost-efficiency and data privacy. You reserve expensive GPU compute for operations that truly require it, while using fast, cheap public APIs for tasks that don’t involve sensitive data.

For organizations requiring 100% air-gapped solutions, you would self-host all components — but the architectural patterns remain the same.

The Production RAG Architecture Blueprint

A production RAG platform is a distributed system, not a single script. To achieve enterprise standards, we must decouple the components into distinct, scalable layers:

The Entry Point (API Gateway): A managed service that handles incoming traffic, authentication, and rate limiting.
The Brain (Python Orchestrator): An asynchronous Python microservice that receives user queries, intelligently routes them, and executes the RAG pipeline.
The Memory (Vector Database): A stateful service like Weaviate or ChromaDB, backed by persistent storage.
The Inference Engine: Specialized engines like vLLM serving open-source LLMs, optimized for high throughput.

The architecture flow:

User → API Gateway → Python Orchestrator → Routing LLM → [RAG Pipeline with Vector DB] OR [Direct LLM] → Response

Part 1: The Application Logic

Let’s build the Python orchestrator using FastAPI. This service will be containerized and deployed to AWS.

The FastAPI Application (main.py)

This is the core orchestrator — the brain of our RAG platform. Its primary job is to receive user queries and intelligently route them to the most appropriate backend service, avoiding unnecessary costs and latency.

The orchestrator implements a two-tier routing strategy: it first uses a lightweight classifier model to analyze the user’s intent. If the query requires access to internal documents, engineering specs, or proprietary data, it triggers the expensive RAG pipeline (vector database search + context-aware LLM inference). If the query is general knowledge that doesn’t need private data, it routes directly to a simpler LLM endpoint, skipping the RAG overhead entirely.

For this demonstration, we use Gemini Flash via API for the routing classifier — a pragmatic choice that keeps the example simple and focused on the architectural pattern. In a production environment, you would replace this with a self-hosted small model (like Llama 3.2 1B or Mistral 7B) running on your AWS infrastructure to maintain complete data privacy. The beauty of this design is that the routing logic remains identical regardless of where the classification model is hosted — you simply swap the endpoint.

This stateless, asynchronous FastAPI service is designed to be containerized and deployed to AWS Lambda or EKS, making it the perfect cloud-native microservice.

import os
import asyncio
from fastapi import FastAPI, HTTPException, status
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

# Configuration loaded from environment variables
ROUTING_MODEL = "gemini-flash-latest"

ROUTING_API_KEY = os.getenv("GOOGLE_API_KEY")
# Pydantic models for strict API contracts
class QueryRequest(BaseModel):
    prompt: str = Field(..., min_length=10, description="The user query")
class QueryResponse(BaseModel):
    source: str
    content: str
    request_id: str
# Simulated backend services
async def execute_rag_pipeline(prompt: str) -> str:
    """
    Executes the RAG pipeline:
    1. Query the vector database for relevant documents
    2. Send context + query to self-hosted LLM
    3. Return the augmented response
    """
    print(f"[RAG] Processing: {prompt[:50]}...")
    await asyncio.sleep(1.5)
    return f"Based on your private documents: {prompt[:50]}..."
async def execute_direct_query(prompt: str) -> str:
    """
    Handles general queries without RAG.
    """
    print(f"[DIRECT] Processing: {prompt[:50]}...")
    await asyncio.sleep(0.5)
    return f"General answer: {prompt[:50]}..."
# Initialize FastAPI
app = FastAPI(
    title="Hybrid RAG Orchestrator",
    description="Intelligent routing between RAG and direct LLM queries",
    version="1.0.0"
)
# Core routing logic
@app.post("/api/v1/query", response_model=QueryResponse)
async def process_query(request: QueryRequest, request_id: str = "local-test"):
    """
    Intelligently routes user queries to the appropriate backend.
    """
    router = ChatOpenAI(
        model=ROUTING_MODEL,
        temperature=0.0,
        base_url="https://generativelanguage.googleapis.com/v1beta/",
        api_key=ROUTING_API_KEY
    )
    decision_prompt = (
        "You are a query classifier. Respond with ONLY one word:\n"
        "- 'RAG' if it requires internal documents or company data\n"
        "- 'DIRECT' if it's general knowledge\n\n"
        f"Query: {request.prompt}\n\nClassification:"
    )
    try:
        response = await router.ainvoke(decision_prompt)
        decision = response.content.strip().upper()
        if "RAG" in decision:
            result_content = await execute_rag_pipeline(request.prompt)
            return QueryResponse(
                source="RAG_PIPELINE",
                content=result_content,
                request_id=request_id
            )
        else:
            result_content = await execute_direct_query(request.prompt)
            return QueryResponse(
                source="DIRECT_LLM",
                content=result_content,
                request_id=request_id
            )
    except Exception as e:
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Processing error: {str(e)}"
        )
@app.get("/health")
def health_check():
    return {"status": "healthy", "service": "rag-orchestrator"}

Requirements File (`requirements.txt`)

fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
langchain-openai==0.0.2
httpx==0.25.2

Dockerfile

FROM python:3.11-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
COPY main.py .
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Part 2: Infrastructure as Code with Pulumi

Now let’s provision the AWS infrastructure using Python and Pulumi.

Infrastructure Code (`main.py`)

import pulumi
import pulumi_aws as aws
import pulumi_awsx as awsx
import json
import os

config = pulumi.Config()
google_api_key = config.require_secret("google_api_key")
# Create ECR repository for Docker images
ecr_repo = awsx.ecr.Repository(
    "rag-orchestrator-repo",
    force_delete=True
)
# Build and push Docker image
image = awsx.ecr.Image(
    "rag-orchestrator-image",
    repository_url=ecr_repo.repository.repository_url,
    context="./app",
    platform="linux/amd64"
)
# IAM Role for Lambda execution
lambda_role = aws.iam.Role(
    "rag-lambda-role",
    assume_role_policy=json.dumps({
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {"Service": "lambda.amazonaws.com"}
        }]
    })
)
aws.iam.RolePolicyAttachment(
    "lambda-exec-policy",
    role=lambda_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
)
# Lambda function from container image
lambda_function = aws.lambda_.Function(
    "rag-orchestrator-function",
    role=lambda_role.arn,
    package_type="Image",
    image_uri=image.image_uri,
    memory_size=1024,
    timeout=30,
    environment={
        "variables": {
            "GOOGLE_API_KEY": google_api_key
        }
    }
)
# API Gateway v2 (HTTP API)
api = aws.apigatewayv2.Api(
    "rag-api",
    protocol_type="HTTP",
    description="RAG Orchestrator API"
)
integration = aws.apigatewayv2.Integration(
    "rag-api-integration",
    api_id=api.id,
    integration_type="AWS_PROXY",
    integration_uri=lambda_function.arn,
    payload_format_version="2.0"
)
route = aws.apigatewayv2.Route(
    "rag-api-route",
    api_id=api.id,
    route_key="POST /api/v1/query",
    target=integration.id.apply(lambda id: f"integrations/{id}")
)
stage = aws.apigatewayv2.Stage(
    "rag-api-stage",
    api_id=api.id,
    name="prod",
    auto_deploy=True
)
lambda_permission = aws.lambda_.Permission(
    "api-gateway-invoke-permission",
    action="lambda:InvokeFunction",
    function=lambda_function.name,
    principal="apigateway.amazonaws.com",
    source_arn=api.execution_arn.apply(lambda arn: f"{arn}/*/*")
)
pulumi.export("api_endpoint", api.api_endpoint.apply(lambda endpoint: f"{endpoint}/prod"))

Deployment Instructions

Prerequisites

AWS Account with credentials configured (aws configure)
Pulumi CLI installed (curl -fsSL https://get.pulumi.com | sh)
Docker installed and running
Python 3.11+ installed
Google API Key for Gemini

Project Structure

rag-orchestrator/
├── app/
│   ├── main.py
│   ├── requirements.txt
│   └── Dockerfile
├── __main__.py
├── Pulumi.yaml
└── requirements.txt

Step-by-Step Deployment

# Create project directory
mkdir rag-orchestrator && cd rag-orchestrator
mkdir app

# Add all the files mentioned above into the correct directories

# Install Pulumi dependencies
pip install pulumi pulumi-aws pulumi-awsx

# Initialize Pulumi
pulumi login
pulumi stack init dev

# Set your Google API key (encrypted)
pulumi config set --secret google_api_key YOUR_KEY

# Deploy everything
pulumi up

# Test the endpoint
curl -X POST "$(pulumi stack output api_endpoint)/api/v1/query" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?"}'

# Cleanup when done
pulumi destroy

Beyond Serverless: The Kubernetes Path

While the serverless approach above is excellent for getting started, enterprise-grade systems eventually require Kubernetes (EKS) for:

Stateful Services: Your vector database needs StatefulSets with PersistentVolumes to ensure data durability.
High-Performance Inference: Serving large models (70B+ parameters) requires GPU-enabled node pools with specialized engines like vLLM.
Observability & AIOps: Production systems need Prometheus, Grafana, and AI-driven anomaly detection.

Key Takeaways

Building a production RAG platform requires more than just connecting an LLM to a vector database. You need:

Intelligent Routing to optimize costs and latency
Infrastructure as Code for reproducible deployments
Containerization for consistency across environments
Security Best Practices (secrets management, IAM roles)
Scalability Architecture (serverless → Kubernetes as needed)

Your next steps:

Replace the simulated RAG pipeline with real vector database queries.
Add authentication and rate limiting to the API Gateway.
Implement monitoring and alerting.
Consider migrating to EKS for stateful components.
Add CI/CD pipelines for automated deployments.

Stop prototyping. Start engineering.

Going Deeper: From Blueprint to Production

This article covers the foundational architecture, but there’s significantly more depth required for production deployments. The transition from a serverless Lambda function to a fully operationalized Kubernetes cluster involves substantial engineering work across multiple domains.

For those interested in the complete implementation details, the “Cloud-Native Python, DevOps & LLMOps” book provides comprehensive coverage of these advanced topics:

Infrastructure provisioning with complete Terraform and Pulumi implementations for AWS EKS clusters.
Stateful service deployment using Helm charts for vector databases like Weaviate and ChromaDB.
High-performance LLM serving with vLLM optimization techniques and GPU resource management.
Operational patterns including AIOps implementations and complete end-to-end project examples.

The book includes production-ready code, Kubernetes manifests, and architectural patterns that extend beyond the scope of an article format: https://www.amazon.com/dp/B0G6FPGCBJ

Explore the complete 15-volume “Python Programming Series” for a comprehensive journey from Python fundamentals to advanced AI deployment: https://www.amazon.com/dp/B0FTTQNXKG. Each book can be read as a standalone.

Check also all the other programming ebooks on Leanpub: https://leanpub.com/u/edgarmilvus.

Tags: Python, LLMOps, AWS, RAG, GenerativeAI, CloudNative, MachineLearning, Pulumi

DEV Community

Building a Hybrid-Private RAG Platform on AWS: From Prototype to Production with Python

Understanding the Hybrid-Private Architecture

The Production RAG Architecture Blueprint

Part 1: The Application Logic

The FastAPI Application (main.py)

Requirements File (`requirements.txt`)

Dockerfile

Part 2: Infrastructure as Code with Pulumi

Infrastructure Code (`main.py`)

Deployment Instructions

Prerequisites

Project Structure

Step-by-Step Deployment

Beyond Serverless: The Kubernetes Path

Key Takeaways

Going Deeper: From Blueprint to Production

Top comments (0)

Understanding the Hybrid-Private Architecture

The Production RAG Architecture Blueprint

Part 1: The Application Logic

The FastAPI Application (main.py)

Requirements File (requirements.txt)

Dockerfile

Part 2: Infrastructure as Code with Pulumi

Infrastructure Code (__main__.py)

Deployment Instructions

Prerequisites

Project Structure

Step-by-Step Deployment

Beyond Serverless: The Kubernetes Path

Key Takeaways

Going Deeper: From Blueprint to Production

Requirements File (`requirements.txt`)

Infrastructure Code (`main.py`)