DEV Community

Kehinde Ogunlowo
Kehinde Ogunlowo

Posted on

Building Production RAG Pipelines on AWS with Bedrock and OpenSearch

RAG (Retrieval-Augmented Generation) is how enterprises are deploying LLMs without fine-tuning. But most tutorials stop at the demo stage. Production RAG is a different beast entirely.

Here's what production RAG actually requires — and how to build it on AWS.

RAG vs Fine-Tuning vs Prompt Engineering

Approach Cost Data Freshness Accuracy Complexity
RAG Medium Real-time High (with good retrieval) Medium
Fine-Tuning High Static (retraining needed) High High
Prompt Engineering Low Static Variable Low

Architecture

The pipeline: Documents → Chunking → Embeddings → Vector Store → Query → Retrieval → LLM → Response.

Python Implementation

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
opensearch = boto3.client("opensearchserverless")

def query_knowledge_base(question: str, collection_id: str) -> str:
    # Generate embedding for the question
    embed_response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": question})
    )
    query_embedding = json.loads(embed_response["body"].read())["embedding"]

    # Search OpenSearch vector store
    results = search_vectors(query_embedding, collection_id, k=5)
    context = "\n".join([r["text"] for r in results])

    # Generate answer with context
    prompt = f"""Based on the following context, answer the question.

Context: {context}

Question: {question}

Answer:"""

    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 1024
        })
    )
    return json.loads(response["body"].read())["content"][0]["text"]
Enter fullscreen mode Exit fullscreen mode

Hallucination Mitigation

  1. Chunk size matters — 512 tokens with 50-token overlap works best
  2. Hybrid search — combine semantic + keyword search (BM25)
  3. Citation grounding — force the model to cite source chunks
  4. Confidence scoring — filter low-relevance retrievals (cosine similarity < 0.7)
  5. Guardrails — use Bedrock Guardrails to block off-topic responses

Open Source Modules

Full article with architecture diagrams: kogunlowo123.github.io


Kehinde Ogunlowo — Principal Multi-Cloud DevSecOps Architect at Citadel Cloud Management



Enter fullscreen mode Exit fullscreen mode

Top comments (1)

Collapse
 
iseecodepeople profile image
Varun S

thanks for sharing!

for production grade, you can look at extending this to include caching and context optimization.

caching - you can implement caching with redis to optimize frequently used queries with ttl.

context optimization - as the chats become longer, to optimize the context being passed to the llm, look at using libraries such as tiktoken to reduce the size of context.

there are other optimizations also possible but these would be the bare minimum.

happy building!