π Tech Acronyms Reference
Quick reference for acronyms used in this article:
- API - Application Programming Interface
- BERT - Bidirectional Encoder Representations from Transformers
- FAISS - Facebook AI Similarity Search
- GPU - Graphics Processing Unit
- JSON - JavaScript Object Notation
- LLM - Large Language Model
- RAG - Retrieval-Augmented Generation
- ROI - Return on Investment
- SQL - Structured Query Language
- VRAM - Video Random Access Memory
π― Introduction: The Knowledge Problem
Large Language Models (LLMs) have a fundamental limitation: their knowledge is frozen at training time.
Ask GPT-4 about:
- "What did our Q3 sales look like?" β β Doesn't know your data
- "What's in our employee handbook?" β β Doesn't have your docs
- "Show me tickets from yesterday" β β No real-time access
- "What did the customer say in ticket #45632?" β β Can't see your database
The LLM has no knowledge of YOUR specific data.
Three solutions exist:
- Fine-tuning: Retrain the model on your data (expensive, slow, static)
- Long context: Put everything in the prompt (limited by context window, expensive)
- RAG: Retrieve relevant data, then generate response (flexible, scalable, cost-effective)
This article is about RAG - the most practical approach for production systems.
π‘ Data Engineer's ROI Lens
For this article, we're focusing on:
- What is RAG? (Architecture and workflow)
- How do I implement it? (Complete working code)
- When should I use RAG vs alternatives? (Decision framework)
RAG is the foundation for connecting LLMs to proprietary data at scale.
ποΈ Part 1: RAG Architecture
The Three-Stage Pipeline
Real-Life Analogy: The Research Assistant
Imagine you hire a research assistant to answer questions about your company:
Stage 1 - Indexing (Preparation):
- Assistant reads all company documents
- Creates organized notes with key topics
- Files everything for quick retrieval
Stage 2 - Retrieval (Finding Relevant Info):
- You ask: "What's our return policy?"
- Assistant searches their notes
- Pulls out the 3 most relevant documents
Stage 3 - Generation (Answering):
- Assistant reads those 3 documents
- Formulates an answer based on what they found
- Responds to your question
RAG works the same way.
The RAG Workflow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INDEXING (Offline) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Documents β Chunking β Embeddings β Vector Database β
β β
β "handbook.pdf" Split into Create vector β
β "policies.docx" β paragraphs β representations β β
β "faqs.md" (chunks) (embeddings) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RETRIEVAL (Query Time) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β User Query β Embed Query β Search Vector DB β Top-K β
β β
β "What's the Create vector Find similar Get 5 β
β return representation β chunks β most β
β policy?" of question (cosine sim) relevantβ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GENERATION (Response) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Retrieved Docs + Query β LLM β Final Answer β
β β
β Context: [5 relevant Send to "Our return policy β
β chunks about returns] GPT-4 β allows returns β
β Question: "return within 30 days..." β
β policy?" β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π» Part 2: Building Your First RAG System
Step 1: Setup and Installation
pip install langchain
pip install chromadb # Vector database
pip install sentence-transformers # Embeddings
pip install litellm # LLM interface
pip install pypdf # PDF processing
Step 2: Document Loading and Chunking
from typing import List
import re
def load_documents(file_paths: List[str]) -> List[str]:
"""Load documents from files"""
documents = []
for path in file_paths:
with open(path, 'r', encoding='utf-8') as f:
content = f.read()
documents.append(content)
return documents
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
"""
Split text into overlapping chunks.
Args:
text: Input text to chunk
chunk_size: Target size of each chunk in characters
overlap: Number of characters to overlap between chunks
"""
# Simple sentence-aware chunking
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
sentence_length = len(sentence)
# If adding this sentence exceeds chunk_size, save current chunk
if current_size + sentence_length > chunk_size and current_chunk:
chunk_text = ' '.join(current_chunk)
chunks.append(chunk_text)
# Start new chunk with overlap
# Keep last few sentences for context
overlap_sentences = []
overlap_size = 0
for s in reversed(current_chunk):
if overlap_size + len(s) < overlap:
overlap_sentences.insert(0, s)
overlap_size += len(s)
else:
break
current_chunk = overlap_sentences
current_size = overlap_size
current_chunk.append(sentence)
current_size += sentence_length
# Add final chunk
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Test chunking
sample_text = """
Our return policy allows returns within 30 days of purchase.
Items must be in original condition with tags attached.
Refunds are processed within 5-7 business days.
For exchanges, we offer free shipping on the replacement item.
Gift returns require the original gift receipt.
Sale items are final sale and cannot be returned.
"""
chunks = chunk_text(sample_text, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}\n")
Output:
Chunk 1: Our return policy allows returns within 30 days of purchase. Items must be in original condition with tags attached.
Chunk 2: Items must be in original condition with tags attached. Refunds are processed within 5-7 business days.
Chunk 3: Refunds are processed within 5-7 business days. For exchanges, we offer free shipping on the replacement item.
Chunk 4: For exchanges, we offer free shipping on the replacement item. Gift returns require the original gift receipt.
Chunk 5: Gift returns require the original gift receipt. Sale items are final sale and cannot be returned.
Step 3: Create Embeddings and Vector Database
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
class VectorStore:
"""Simple vector database wrapper"""
def __init__(self, collection_name: str = "documents"):
# Initialize embedding model
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Initialize ChromaDB
self.client = chromadb.Client(Settings(
anonymized_telemetry=False
))
# Create or get collection
self.collection = self.client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
def add_documents(self, texts: List[str], metadata: List[dict] = None):
"""Add documents to vector store"""
# Generate embeddings
embeddings = self.embedding_model.encode(texts).tolist()
# Generate IDs
ids = [f"doc_{i}" for i in range(len(texts))]
# Add to collection
self.collection.add(
embeddings=embeddings,
documents=texts,
ids=ids,
metadatas=metadata if metadata else [{}] * len(texts)
)
print(f"Added {len(texts)} documents to vector store")
def search(self, query: str, top_k: int = 5) -> List[dict]:
"""Search for similar documents"""
# Embed query
query_embedding = self.embedding_model.encode([query]).tolist()
# Search
results = self.collection.query(
query_embeddings=query_embedding,
n_results=top_k
)
# Format results
documents = []
for i in range(len(results['documents'][0])):
documents.append({
'text': results['documents'][0][i],
'distance': results['distances'][0][i],
'metadata': results['metadatas'][0][i]
})
return documents
# Example usage
vector_store = VectorStore(collection_name="company_docs")
# Sample company documents
documents = [
"Our return policy allows returns within 30 days of purchase with original receipt.",
"Shipping is free for orders over $50. Standard shipping takes 3-5 business days.",
"We offer 24/7 customer support via phone, email, and live chat.",
"All products come with a 1-year manufacturer warranty covering defects.",
"International shipping is available to over 100 countries worldwide.",
"Our price match guarantee ensures you get the best deal within 14 days of purchase."
]
# Add documents
vector_store.add_documents(documents)
# Search
query = "How long do I have to return something?"
results = vector_store.search(query, top_k=3)
print(f"\nQuery: {query}\n")
for i, result in enumerate(results):
print(f"Result {i+1} (distance: {result['distance']:.3f}):")
print(f"{result['text']}\n")
Output:
Added 6 documents to vector store
Query: How long do I have to return something?
Result 1 (distance: 0.312):
Our return policy allows returns within 30 days of purchase with original receipt.
Result 2 (distance: 0.689):
Our price match guarantee ensures you get the best deal within 14 days of purchase.
Result 3 (distance: 0.724):
All products come with a 1-year manufacturer warranty covering defects.
Step 4: Complete RAG Pipeline
from litellm import completion
class RAGSystem:
"""Complete RAG system"""
def __init__(self, vector_store: VectorStore, model: str = "gpt-4"):
self.vector_store = vector_store
self.model = model
def query(
self,
question: str,
top_k: int = 5,
temperature: float = 0.0
) -> dict:
"""
Query the RAG system.
Returns:
dict with 'answer', 'sources', and 'retrieved_docs'
"""
# Step 1: Retrieve relevant documents
retrieved_docs = self.vector_store.search(question, top_k=top_k)
# Step 2: Build context from retrieved documents
context = "\n\n".join([
f"[Document {i+1}]\n{doc['text']}"
for i, doc in enumerate(retrieved_docs)
])
# Step 3: Create prompt with context
prompt = f"""Answer the question based on the context below. If the answer is not in the context, say "I don't have enough information to answer that."
Context:
{context}
Question: {question}
Answer:"""
# Step 4: Generate response
response = completion(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature
)
answer = response.choices[0].message.content
return {
'answer': answer,
'sources': [doc['text'] for doc in retrieved_docs],
'retrieved_docs': retrieved_docs,
'context': context
}
def query_with_citation(self, question: str, top_k: int = 5) -> str:
"""Query with inline citations"""
result = self.query(question, top_k=top_k)
# Build answer with citations
answer = result['answer']
sources = result['sources']
response = f"{answer}\n\nSources:\n"
for i, source in enumerate(sources[:3]): # Show top 3 sources
response += f"{i+1}. {source}\n"
return response
# Create RAG system
rag = RAGSystem(vector_store, model="gpt-4")
# Test queries
queries = [
"What's your return policy?",
"Do you offer international shipping?",
"How can I contact customer support?",
"What about warranties?"
]
for query in queries:
print(f"{'='*60}")
print(f"Q: {query}")
print(f"{'='*60}")
result = rag.query(query, top_k=3)
print(f"A: {result['answer']}")
print(f"\nRetrieved {len(result['retrieved_docs'])} relevant documents\n")
Output:
============================================================
Q: What's your return policy?
============================================================
A: Our return policy allows returns within 30 days of purchase, provided you have the original receipt.
Retrieved 3 relevant documents
============================================================
Q: Do you offer international shipping?
============================================================
A: Yes, international shipping is available to over 100 countries worldwide.
Retrieved 3 relevant documents
============================================================
Q: How can I contact customer support?
============================================================
A: We offer 24/7 customer support through phone, email, and live chat.
Retrieved 3 relevant documents
============================================================
Q: What about warranties?
============================================================
A: All products come with a 1-year manufacturer warranty that covers defects.
Retrieved 3 relevant documents
Step 5: Advanced RAG with Metadata Filtering
class AdvancedRAGSystem(RAGSystem):
"""RAG with metadata filtering"""
def query_with_filters(
self,
question: str,
filters: dict = None,
top_k: int = 5
) -> dict:
"""
Query with metadata filters.
filters example: {"category": "returns", "department": "sales"}
"""
# Search with filters (ChromaDB syntax)
query_embedding = self.vector_store.embedding_model.encode([question]).tolist()
where_clause = filters if filters else None
results = self.vector_store.collection.query(
query_embeddings=query_embedding,
n_results=top_k,
where=where_clause
)
# Format retrieved docs
retrieved_docs = []
for i in range(len(results['documents'][0])):
retrieved_docs.append({
'text': results['documents'][0][i],
'distance': results['distances'][0][i],
'metadata': results['metadatas'][0][i]
})
# Build context
context = "\n\n".join([
f"[Document {i+1}]\n{doc['text']}"
for i, doc in enumerate(retrieved_docs)
])
# Generate
prompt = f"""Answer based on the context. If not in context, say you don't know.
Context:
{context}
Question: {question}
Answer:"""
response = completion(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.0
)
return {
'answer': response.choices[0].message.content,
'sources': [doc['text'] for doc in retrieved_docs],
'retrieved_docs': retrieved_docs
}
# Add documents with metadata
documents_with_metadata = [
("Returns are accepted within 30 days with receipt.", {"category": "returns", "department": "sales"}),
("Exchanges are free within 60 days of purchase.", {"category": "returns", "department": "sales"}),
("Technical support available 24/7 via phone.", {"category": "support", "department": "technical"}),
("Shipping is free over $50 within USA.", {"category": "shipping", "department": "logistics"}),
]
# Create new vector store with metadata
vector_store_meta = VectorStore(collection_name="docs_with_metadata")
texts = [doc[0] for doc in documents_with_metadata]
metadata = [doc[1] for doc in documents_with_metadata]
vector_store_meta.add_documents(texts, metadata)
# Query with filters
advanced_rag = AdvancedRAGSystem(vector_store_meta)
result = advanced_rag.query_with_filters(
question="What's the policy on returns?",
filters={"category": "returns"}, # Only search returns category
top_k=3
)
print(f"Answer: {result['answer']}")
print(f"\nSources found (filtered to 'returns' category only):")
for source in result['sources']:
print(f"- {source}")
Output:
Answer: Returns are accepted within 30 days with the original receipt. Additionally, exchanges are free within 60 days of purchase.
Sources found (filtered to 'returns' category only):
- Returns are accepted within 30 days with receipt.
- Exchanges are free within 60 days of purchase.
βοΈ Part 3: RAG vs Alternatives
Decision Framework
START
β
ββ Do you need the model to "know" new information?
β ββ NO β Use base LLM (no RAG needed)
β ββ YES β Continue
β
ββ Does the information change frequently?
β ββ YES β RAG (dynamic, real-time updates)
β ββ MAYBE β Continue
β
ββ Is the information private/proprietary?
β ββ YES β RAG or Fine-tuning (don't put in training data)
β ββ NO β Continue
β
ββ How much data?
β ββ <10 docs β Long context (put in prompt)
β ββ 10-10,000 docs β RAG
β ββ >10,000 docs or specialized domain β RAG + possible fine-tuning
β
ββ Do you need the model to change its behavior/style?
β ββ YES β Fine-tuning
β ββ NO β RAG
β
ββ END
Comparison Table
| Approach | Best For | Cost | Update Speed | Complexity |
|---|---|---|---|---|
| Base LLM | General knowledge already in model | $ | N/A | Low |
| Long Context | <10 documents, static info | $$ | Instant | Low |
| RAG | 10-10K docs, frequently updated | $$$ | Real-time | Medium |
| Fine-tuning | Specialized domain, behavior changes | $$$$ | Slow (retrain) | High |
| RAG + Fine-tuning | Large-scale specialized systems | $$$$$ | Mixed | High |
When to Use RAG
β Use RAG when:
- Information updates frequently (daily/weekly)
- You have 10+ documents but <1M documents
- Need to cite sources (show where answer came from)
- Data is proprietary (can't put in training data)
- Want to control what model can access
- Need real-time information
- Budget-conscious (cheaper than fine-tuning)
β Don't use RAG when:
- Information fits in one prompt (use long context)
- Need to change model behavior/style (use fine-tuning)
- Need sub-millisecond response (caching might help)
- Only have 1-2 documents (just put in prompt)
Real-World Example: Customer Support System
Scenario: E-commerce company, 500 help articles, updated weekly
Option 1: Long Context
- Put all 500 articles in every prompt
- Cost: 500 articles Γ 500 tokens = 250K tokens per query
- At $0.01/1K tokens: $2.50 per query
- 10K queries/day = $25,000/day = $750K/month β
Option 2: RAG
- Index 500 articles once
- Retrieve top 5 relevant articles per query
- Cost: 5 articles Γ 500 tokens = 2.5K tokens per query
- At $0.01/1K tokens: $0.025 per query
- 10K queries/day = $250/day = $7,500/month β
RAG is 100x cheaper for this use case.
Option 3: Fine-tuning
- Train model on all 500 articles
- Cost: $1,000-5,000 initial training
- Must retrain weekly (articles update)
- Annual cost: $50K-250K β
- Plus: model might hallucinate (memorized but not retrieved)
π― Conclusion: RAG as Production Foundation
Retrieval-Augmented Generation bridges the gap between LLMs' general knowledge and your specific data.
The Business Impact:
π° Cost:
- 10-100x cheaper than long context for large doc sets
- No retraining costs (unlike fine-tuning)
- Pay only for what you retrieve
π Quality:
- Always uses latest information (no stale knowledge)
- Citable sources (transparency and trust)
- Controlled access (retrieves only relevant data)
β‘ Performance:
- Real-time updates (add/remove docs instantly)
- Scales to millions of documents
- Fast retrieval (<100ms typical)
Key Takeaways for Data Engineers
On RAG Architecture:
- Three stages: Indexing (offline), Retrieval (query-time), Generation (response)
- Chunking strategy affects retrieval quality
- Embeddings model choice impacts accuracy
- Action: Start with all-MiniLM-L6-v2 (384d), upgrade if needed
- ROI Impact: Proper chunking = 30-50% better retrieval accuracy
On Implementation:
- Use vector databases (ChromaDB, FAISS, Pinecone, Weaviate)
- Metadata filtering enables domain-specific retrieval
- Monitor retrieval quality (are top-K results relevant?)
- Action: Build evaluation set of 50-100 query/answer pairs
- ROI Impact: 1% retrieval accuracy = measurable answer quality
On When to Use RAG:
- Default choice for 10-10K documents
- Essential for frequently updated information
- Cheaper than alternatives for most use cases
- Action: Use decision framework, measure actual costs
- ROI Impact: $742K/month savings example (vs long context)
The RAG ROI Pattern
Every decision follows this pattern:
- Measure your data β How many docs? How often updated?
- Calculate costs β Long context vs RAG vs fine-tuning
- Start simple β Basic RAG with default embedding model
- Optimize iteratively β Better chunking, metadata, reranking
Real-World Example:
Legal tech company analyzing contracts:
Before RAG:
- Manual search through 50K contracts
- 30 minutes per contract to find relevant clauses
- 100 contracts/day = 50 hours/day of paralegal time
- Cost: $2,500/day labor
After RAG:
- Indexed all 50K contracts (one-time, 2 hours)
- RAG retrieves relevant clauses instantly
- Paralegals review only retrieved sections (5 min/contract)
- 100 contracts/day = 8.3 hours/day of paralegal time
- Cost: $415/day labor + $50/day API
Savings: $2,085/day = $522K/year
This is why RAG matters. Not as a buzzwordβbut as the practical foundation for connecting LLMs to real-world data at scale.
Next: We'll dive deep into chunking strategies (Article 8) - the critical factor that determines RAG quality.
Top comments (0)