RAG (Retrieval-Augmented Generation) is how enterprises are deploying LLMs without fine-tuning. But most tutorials stop at the demo stage. Production RAG is a different beast entirely.
Here's what production RAG actually requires — and how to build it on AWS.
RAG vs Fine-Tuning vs Prompt Engineering
| Approach | Cost | Data Freshness | Accuracy | Complexity |
|---|---|---|---|---|
| RAG | Medium | Real-time | High (with good retrieval) | Medium |
| Fine-Tuning | High | Static (retraining needed) | High | High |
| Prompt Engineering | Low | Static | Variable | Low |
Architecture
The pipeline: Documents → Chunking → Embeddings → Vector Store → Query → Retrieval → LLM → Response.
Python Implementation
import boto3
import json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
opensearch = boto3.client("opensearchserverless")
def query_knowledge_base(question: str, collection_id: str) -> str:
# Generate embedding for the question
embed_response = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({"inputText": question})
)
query_embedding = json.loads(embed_response["body"].read())["embedding"]
# Search OpenSearch vector store
results = search_vectors(query_embedding, collection_id, k=5)
context = "\n".join([r["text"] for r in results])
# Generate answer with context
prompt = f"""Based on the following context, answer the question.
Context: {context}
Question: {question}
Answer:"""
response = bedrock.invoke_model(
modelId="anthropic.claude-3-sonnet-20240229-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 1024
})
)
return json.loads(response["body"].read())["content"][0]["text"]
Hallucination Mitigation
- Chunk size matters — 512 tokens with 50-token overlap works best
- Hybrid search — combine semantic + keyword search (BM25)
- Citation grounding — force the model to cite source chunks
- Confidence scoring — filter low-relevance retrievals (cosine similarity < 0.7)
- Guardrails — use Bedrock Guardrails to block off-topic responses
Open Source Modules
- terraform-aws-rag-pipeline — Complete RAG infrastructure
- terraform-aws-bedrock-platform — Bedrock foundation
- terraform-aws-bedrock-agents — Autonomous Bedrock agents
Full article with architecture diagrams: kogunlowo123.github.io
Kehinde Ogunlowo — Principal Multi-Cloud DevSecOps Architect at Citadel Cloud Management
Top comments (1)
thanks for sharing!
for production grade, you can look at extending this to include caching and context optimization.
caching - you can implement caching with redis to optimize frequently used queries with ttl.
context optimization - as the chats become longer, to optimize the context being passed to the llm, look at using libraries such as tiktoken to reduce the size of context.
there are other optimizations also possible but these would be the bare minimum.
happy building!