Building Production RAG Pipelines on AWS with Bedrock and OpenSearch

#machinelearning #aws #ai #python

RAG (Retrieval-Augmented Generation) is how enterprises are deploying LLMs without fine-tuning. But most tutorials stop at the demo stage. Production RAG is a different beast entirely.

Here's what production RAG actually requires — and how to build it on AWS.

RAG vs Fine-Tuning vs Prompt Engineering

Approach	Cost	Data Freshness	Accuracy	Complexity
RAG	Medium	Real-time	High (with good retrieval)	Medium
Fine-Tuning	High	Static (retraining needed)	High	High
Prompt Engineering	Low	Static	Variable	Low

Architecture

The pipeline: Documents → Chunking → Embeddings → Vector Store → Query → Retrieval → LLM → Response.

Python Implementation

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
opensearch = boto3.client("opensearchserverless")

def query_knowledge_base(question: str, collection_id: str) -> str:
    # Generate embedding for the question
    embed_response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": question})
    )
    query_embedding = json.loads(embed_response["body"].read())["embedding"]

    # Search OpenSearch vector store
    results = search_vectors(query_embedding, collection_id, k=5)
    context = "\n".join([r["text"] for r in results])

    # Generate answer with context
    prompt = f"""Based on the following context, answer the question.

Context: {context}

Question: {question}

Answer:"""

    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 1024
        })
    )
    return json.loads(response["body"].read())["content"][0]["text"]

Hallucination Mitigation

Chunk size matters — 512 tokens with 50-token overlap works best
Hybrid search — combine semantic + keyword search (BM25)
Citation grounding — force the model to cite source chunks
Confidence scoring — filter low-relevance retrievals (cosine similarity < 0.7)
Guardrails — use Bedrock Guardrails to block off-topic responses

Open Source Modules

terraform-aws-rag-pipeline — Complete RAG infrastructure
terraform-aws-bedrock-platform — Bedrock foundation
terraform-aws-bedrock-agents — Autonomous Bedrock agents

Full article with architecture diagrams: kogunlowo123.github.io

Kehinde Ogunlowo — Principal Multi-Cloud DevSecOps Architect at Citadel Cloud Management

Top comments (1)

Varun Seth • Mar 8

thanks for sharing!

for production grade, you can look at extending this to include caching and context optimization.

caching - you can implement caching with redis to optimize frequently used queries with ttl.

context optimization - as the chats become longer, to optimize the context being passed to the llm, look at using libraries such as tiktoken to reduce the size of context.

there are other optimizations also possible but these would be the bare minimum.

happy building!