Vinicius Fagundes

Posted on Dec 7

Retrieval-Augmented Generation: Connecting LLMs to Your Data

#ai #datascience #llm #rag

📚 Tech Acronyms Reference

Quick reference for acronyms used in this article:

API - Application Programming Interface
BERT - Bidirectional Encoder Representations from Transformers
FAISS - Facebook AI Similarity Search
GPU - Graphics Processing Unit
JSON - JavaScript Object Notation
LLM - Large Language Model
RAG - Retrieval-Augmented Generation
ROI - Return on Investment
SQL - Structured Query Language
VRAM - Video Random Access Memory

🎯 Introduction: The Knowledge Problem

Large Language Models (LLMs) have a fundamental limitation: their knowledge is frozen at training time.

Ask GPT-4 about:

"What did our Q3 sales look like?" → ❌ Doesn't know your data
"What's in our employee handbook?" → ❌ Doesn't have your docs
"Show me tickets from yesterday" → ❌ No real-time access
"What did the customer say in ticket #45632?" → ❌ Can't see your database

The LLM has no knowledge of YOUR specific data.

Three solutions exist:

Fine-tuning: Retrain the model on your data (expensive, slow, static)
Long context: Put everything in the prompt (limited by context window, expensive)
RAG: Retrieve relevant data, then generate response (flexible, scalable, cost-effective)

This article is about RAG - the most practical approach for production systems.

💡 Data Engineer's ROI Lens

For this article, we're focusing on:

What is RAG? (Architecture and workflow)
How do I implement it? (Complete working code)
When should I use RAG vs alternatives? (Decision framework)

RAG is the foundation for connecting LLMs to proprietary data at scale.

🏗️ Part 1: RAG Architecture

The Three-Stage Pipeline

Real-Life Analogy: The Research Assistant

Imagine you hire a research assistant to answer questions about your company:

Stage 1 - Indexing (Preparation):

Assistant reads all company documents
Creates organized notes with key topics
Files everything for quick retrieval

Stage 2 - Retrieval (Finding Relevant Info):

You ask: "What's our return policy?"
Assistant searches their notes
Pulls out the 3 most relevant documents

Stage 3 - Generation (Answering):

Assistant reads those 3 documents
Formulates an answer based on what they found
Responds to your question

RAG works the same way.

The RAG Workflow

┌─────────────────────────────────────────────────────────┐
│                    INDEXING (Offline)                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Documents → Chunking → Embeddings → Vector Database    │
│                                                          │
│  "handbook.pdf"     Split into        Create vector     │
│  "policies.docx" → paragraphs    →   representations → │
│  "faqs.md"          (chunks)          (embeddings)      │
│                                                          │
└─────────────────────────────────────────────────────────┘

                         ↓

┌─────────────────────────────────────────────────────────┐
│                  RETRIEVAL (Query Time)                  │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  User Query → Embed Query → Search Vector DB → Top-K    │
│                                                          │
│  "What's the     Create vector    Find similar    Get 5 │
│   return        representation → chunks        → most   │
│   policy?"      of question       (cosine sim)    relevant│
│                                                          │
└─────────────────────────────────────────────────────────┘

                         ↓

┌─────────────────────────────────────────────────────────┐
│                  GENERATION (Response)                   │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Retrieved Docs + Query → LLM → Final Answer            │
│                                                          │
│  Context: [5 relevant   Send to    "Our return policy   │
│  chunks about returns]  GPT-4   →  allows returns       │
│  Question: "return             within 30 days..."        │
│  policy?"                                                │
│                                                          │
└─────────────────────────────────────────────────────────┘

💻 Part 2: Building Your First RAG System

Step 1: Setup and Installation

pip install langchain
pip install chromadb  # Vector database
pip install sentence-transformers  # Embeddings
pip install litellm  # LLM interface
pip install pypdf  # PDF processing

Step 2: Document Loading and Chunking

from typing import List
import re

def load_documents(file_paths: List[str]) -> List[str]:
    """Load documents from files"""
    documents = []

    for path in file_paths:
        with open(path, 'r', encoding='utf-8') as f:
            content = f.read()
            documents.append(content)

    return documents

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Split text into overlapping chunks.

    Args:
        text: Input text to chunk
        chunk_size: Target size of each chunk in characters
        overlap: Number of characters to overlap between chunks
    """
    # Simple sentence-aware chunking
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = []
    current_size = 0

    for sentence in sentences:
        sentence_length = len(sentence)

        # If adding this sentence exceeds chunk_size, save current chunk
        if current_size + sentence_length > chunk_size and current_chunk:
            chunk_text = ' '.join(current_chunk)
            chunks.append(chunk_text)

            # Start new chunk with overlap
            # Keep last few sentences for context
            overlap_sentences = []
            overlap_size = 0
            for s in reversed(current_chunk):
                if overlap_size + len(s) < overlap:
                    overlap_sentences.insert(0, s)
                    overlap_size += len(s)
                else:
                    break

            current_chunk = overlap_sentences
            current_size = overlap_size

        current_chunk.append(sentence)
        current_size += sentence_length

    # Add final chunk
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Test chunking
sample_text = """
Our return policy allows returns within 30 days of purchase. 
Items must be in original condition with tags attached. 
Refunds are processed within 5-7 business days.

For exchanges, we offer free shipping on the replacement item.
Gift returns require the original gift receipt.
Sale items are final sale and cannot be returned.
"""

chunks = chunk_text(sample_text, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Output:

Chunk 1: Our return policy allows returns within 30 days of purchase. Items must be in original condition with tags attached.

Chunk 2: Items must be in original condition with tags attached. Refunds are processed within 5-7 business days.

Chunk 3: Refunds are processed within 5-7 business days. For exchanges, we offer free shipping on the replacement item.

Chunk 4: For exchanges, we offer free shipping on the replacement item. Gift returns require the original gift receipt.

Chunk 5: Gift returns require the original gift receipt. Sale items are final sale and cannot be returned.

Step 3: Create Embeddings and Vector Database

from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

class VectorStore:
    """Simple vector database wrapper"""

    def __init__(self, collection_name: str = "documents"):
        # Initialize embedding model
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

        # Initialize ChromaDB
        self.client = chromadb.Client(Settings(
            anonymized_telemetry=False
        ))

        # Create or get collection
        self.collection = self.client.create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}  # Use cosine similarity
        )

    def add_documents(self, texts: List[str], metadata: List[dict] = None):
        """Add documents to vector store"""
        # Generate embeddings
        embeddings = self.embedding_model.encode(texts).tolist()

        # Generate IDs
        ids = [f"doc_{i}" for i in range(len(texts))]

        # Add to collection
        self.collection.add(
            embeddings=embeddings,
            documents=texts,
            ids=ids,
            metadatas=metadata if metadata else [{}] * len(texts)
        )

        print(f"Added {len(texts)} documents to vector store")

    def search(self, query: str, top_k: int = 5) -> List[dict]:
        """Search for similar documents"""
        # Embed query
        query_embedding = self.embedding_model.encode([query]).tolist()

        # Search
        results = self.collection.query(
            query_embeddings=query_embedding,
            n_results=top_k
        )

        # Format results
        documents = []
        for i in range(len(results['documents'][0])):
            documents.append({
                'text': results['documents'][0][i],
                'distance': results['distances'][0][i],
                'metadata': results['metadatas'][0][i]
            })

        return documents

# Example usage
vector_store = VectorStore(collection_name="company_docs")

# Sample company documents
documents = [
    "Our return policy allows returns within 30 days of purchase with original receipt.",
    "Shipping is free for orders over $50. Standard shipping takes 3-5 business days.",
    "We offer 24/7 customer support via phone, email, and live chat.",
    "All products come with a 1-year manufacturer warranty covering defects.",
    "International shipping is available to over 100 countries worldwide.",
    "Our price match guarantee ensures you get the best deal within 14 days of purchase."
]

# Add documents
vector_store.add_documents(documents)

# Search
query = "How long do I have to return something?"
results = vector_store.search(query, top_k=3)

print(f"\nQuery: {query}\n")
for i, result in enumerate(results):
    print(f"Result {i+1} (distance: {result['distance']:.3f}):")
    print(f"{result['text']}\n")

Output:

Added 6 documents to vector store

Query: How long do I have to return something?

Result 1 (distance: 0.312):
Our return policy allows returns within 30 days of purchase with original receipt.

Result 2 (distance: 0.689):
Our price match guarantee ensures you get the best deal within 14 days of purchase.

Result 3 (distance: 0.724):
All products come with a 1-year manufacturer warranty covering defects.

Step 4: Complete RAG Pipeline

from litellm import completion

class RAGSystem:
    """Complete RAG system"""

    def __init__(self, vector_store: VectorStore, model: str = "gpt-4"):
        self.vector_store = vector_store
        self.model = model

    def query(
        self,
        question: str,
        top_k: int = 5,
        temperature: float = 0.0
    ) -> dict:
        """
        Query the RAG system.

        Returns:
            dict with 'answer', 'sources', and 'retrieved_docs'
        """
        # Step 1: Retrieve relevant documents
        retrieved_docs = self.vector_store.search(question, top_k=top_k)

        # Step 2: Build context from retrieved documents
        context = "\n\n".join([
            f"[Document {i+1}]\n{doc['text']}"
            for i, doc in enumerate(retrieved_docs)
        ])

        # Step 3: Create prompt with context
        prompt = f"""Answer the question based on the context below. If the answer is not in the context, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

        # Step 4: Generate response
        response = completion(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature
        )

        answer = response.choices[0].message.content

        return {
            'answer': answer,
            'sources': [doc['text'] for doc in retrieved_docs],
            'retrieved_docs': retrieved_docs,
            'context': context
        }

    def query_with_citation(self, question: str, top_k: int = 5) -> str:
        """Query with inline citations"""
        result = self.query(question, top_k=top_k)

        # Build answer with citations
        answer = result['answer']
        sources = result['sources']

        response = f"{answer}\n\nSources:\n"
        for i, source in enumerate(sources[:3]):  # Show top 3 sources
            response += f"{i+1}. {source}\n"

        return response

# Create RAG system
rag = RAGSystem(vector_store, model="gpt-4")

# Test queries
queries = [
    "What's your return policy?",
    "Do you offer international shipping?",
    "How can I contact customer support?",
    "What about warranties?"
]

for query in queries:
    print(f"{'='*60}")
    print(f"Q: {query}")
    print(f"{'='*60}")

    result = rag.query(query, top_k=3)
    print(f"A: {result['answer']}")
    print(f"\nRetrieved {len(result['retrieved_docs'])} relevant documents\n")

Output:

============================================================
Q: What's your return policy?
============================================================
A: Our return policy allows returns within 30 days of purchase, provided you have the original receipt.

Retrieved 3 relevant documents

============================================================
Q: Do you offer international shipping?
============================================================
A: Yes, international shipping is available to over 100 countries worldwide.

Retrieved 3 relevant documents

============================================================
Q: How can I contact customer support?
============================================================
A: We offer 24/7 customer support through phone, email, and live chat.

Retrieved 3 relevant documents

============================================================
Q: What about warranties?
============================================================
A: All products come with a 1-year manufacturer warranty that covers defects.

Retrieved 3 relevant documents

Step 5: Advanced RAG with Metadata Filtering

class AdvancedRAGSystem(RAGSystem):
    """RAG with metadata filtering"""

    def query_with_filters(
        self,
        question: str,
        filters: dict = None,
        top_k: int = 5
    ) -> dict:
        """
        Query with metadata filters.

        filters example: {"category": "returns", "department": "sales"}
        """
        # Search with filters (ChromaDB syntax)
        query_embedding = self.vector_store.embedding_model.encode([question]).tolist()

        where_clause = filters if filters else None

        results = self.vector_store.collection.query(
            query_embeddings=query_embedding,
            n_results=top_k,
            where=where_clause
        )

        # Format retrieved docs
        retrieved_docs = []
        for i in range(len(results['documents'][0])):
            retrieved_docs.append({
                'text': results['documents'][0][i],
                'distance': results['distances'][0][i],
                'metadata': results['metadatas'][0][i]
            })

        # Build context
        context = "\n\n".join([
            f"[Document {i+1}]\n{doc['text']}"
            for i, doc in enumerate(retrieved_docs)
        ])

        # Generate
        prompt = f"""Answer based on the context. If not in context, say you don't know.

Context:
{context}

Question: {question}

Answer:"""

        response = completion(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0
        )

        return {
            'answer': response.choices[0].message.content,
            'sources': [doc['text'] for doc in retrieved_docs],
            'retrieved_docs': retrieved_docs
        }

# Add documents with metadata
documents_with_metadata = [
    ("Returns are accepted within 30 days with receipt.", {"category": "returns", "department": "sales"}),
    ("Exchanges are free within 60 days of purchase.", {"category": "returns", "department": "sales"}),
    ("Technical support available 24/7 via phone.", {"category": "support", "department": "technical"}),
    ("Shipping is free over $50 within USA.", {"category": "shipping", "department": "logistics"}),
]

# Create new vector store with metadata
vector_store_meta = VectorStore(collection_name="docs_with_metadata")
texts = [doc[0] for doc in documents_with_metadata]
metadata = [doc[1] for doc in documents_with_metadata]
vector_store_meta.add_documents(texts, metadata)

# Query with filters
advanced_rag = AdvancedRAGSystem(vector_store_meta)

result = advanced_rag.query_with_filters(
    question="What's the policy on returns?",
    filters={"category": "returns"},  # Only search returns category
    top_k=3
)

print(f"Answer: {result['answer']}")
print(f"\nSources found (filtered to 'returns' category only):")
for source in result['sources']:
    print(f"- {source}")

Output:

Answer: Returns are accepted within 30 days with the original receipt. Additionally, exchanges are free within 60 days of purchase.

Sources found (filtered to 'returns' category only):
- Returns are accepted within 30 days with receipt.
- Exchanges are free within 60 days of purchase.

⚖️ Part 3: RAG vs Alternatives

Decision Framework

START
  │
  ├─ Do you need the model to "know" new information?
  │    ├─ NO → Use base LLM (no RAG needed)
  │    └─ YES → Continue
  │
  ├─ Does the information change frequently?
  │    ├─ YES → RAG (dynamic, real-time updates)
  │    └─ MAYBE → Continue
  │
  ├─ Is the information private/proprietary?
  │    ├─ YES → RAG or Fine-tuning (don't put in training data)
  │    └─ NO → Continue
  │
  ├─ How much data?
  │    ├─ <10 docs → Long context (put in prompt)
  │    ├─ 10-10,000 docs → RAG
  │    └─ >10,000 docs or specialized domain → RAG + possible fine-tuning
  │
  ├─ Do you need the model to change its behavior/style?
  │    ├─ YES → Fine-tuning
  │    └─ NO → RAG
  │
  └─ END

Comparison Table

Approach	Best For	Cost	Update Speed	Complexity
Base LLM	General knowledge already in model	$	N/A	Low
Long Context	<10 documents, static info	$$	Instant	Low
RAG	10-10K docs, frequently updated	$$$	Real-time	Medium
Fine-tuning	Specialized domain, behavior changes	$$$$	Slow (retrain)	High
RAG + Fine-tuning	Large-scale specialized systems	$$$$$	Mixed	High

When to Use RAG

✅ Use RAG when:

Information updates frequently (daily/weekly)
You have 10+ documents but <1M documents
Need to cite sources (show where answer came from)
Data is proprietary (can't put in training data)
Want to control what model can access
Need real-time information
Budget-conscious (cheaper than fine-tuning)

❌ Don't use RAG when:

Information fits in one prompt (use long context)
Need to change model behavior/style (use fine-tuning)
Need sub-millisecond response (caching might help)
Only have 1-2 documents (just put in prompt)

Real-World Example: Customer Support System

Scenario: E-commerce company, 500 help articles, updated weekly

Option 1: Long Context

Put all 500 articles in every prompt
Cost: 500 articles × 500 tokens = 250K tokens per query
At $0.01/1K tokens: $2.50 per query
10K queries/day = $25,000/day = $750K/month ❌

Option 2: RAG

Index 500 articles once
Retrieve top 5 relevant articles per query
Cost: 5 articles × 500 tokens = 2.5K tokens per query
At $0.01/1K tokens: $0.025 per query
10K queries/day = $250/day = $7,500/month ✅

RAG is 100x cheaper for this use case.

Option 3: Fine-tuning

Train model on all 500 articles
Cost: $1,000-5,000 initial training
Must retrain weekly (articles update)
Annual cost: $50K-250K ❌
Plus: model might hallucinate (memorized but not retrieved)

🎯 Conclusion: RAG as Production Foundation

Retrieval-Augmented Generation bridges the gap between LLMs' general knowledge and your specific data.

The Business Impact:

💰 Cost:

10-100x cheaper than long context for large doc sets
No retraining costs (unlike fine-tuning)
Pay only for what you retrieve

📊 Quality:

Always uses latest information (no stale knowledge)
Citable sources (transparency and trust)
Controlled access (retrieves only relevant data)

⚡ Performance:

Real-time updates (add/remove docs instantly)
Scales to millions of documents
Fast retrieval (<100ms typical)

Key Takeaways for Data Engineers

On RAG Architecture:

Three stages: Indexing (offline), Retrieval (query-time), Generation (response)
Chunking strategy affects retrieval quality
Embeddings model choice impacts accuracy
Action: Start with all-MiniLM-L6-v2 (384d), upgrade if needed
ROI Impact: Proper chunking = 30-50% better retrieval accuracy

On Implementation:

Use vector databases (ChromaDB, FAISS, Pinecone, Weaviate)
Metadata filtering enables domain-specific retrieval
Monitor retrieval quality (are top-K results relevant?)
Action: Build evaluation set of 50-100 query/answer pairs
ROI Impact: 1% retrieval accuracy = measurable answer quality

On When to Use RAG:

Default choice for 10-10K documents
Essential for frequently updated information
Cheaper than alternatives for most use cases
Action: Use decision framework, measure actual costs
ROI Impact: $742K/month savings example (vs long context)

The RAG ROI Pattern

Every decision follows this pattern:

Measure your data → How many docs? How often updated?
Calculate costs → Long context vs RAG vs fine-tuning
Start simple → Basic RAG with default embedding model
Optimize iteratively → Better chunking, metadata, reranking

Real-World Example:

Legal tech company analyzing contracts:

Before RAG:

Manual search through 50K contracts
30 minutes per contract to find relevant clauses
100 contracts/day = 50 hours/day of paralegal time
Cost: $2,500/day labor

After RAG:

Indexed all 50K contracts (one-time, 2 hours)
RAG retrieves relevant clauses instantly
Paralegals review only retrieved sections (5 min/contract)
100 contracts/day = 8.3 hours/day of paralegal time
Cost: $415/day labor + $50/day API

Savings: $2,085/day = $522K/year

This is why RAG matters. Not as a buzzword—but as the practical foundation for connecting LLMs to real-world data at scale.

Next: We'll dive deep into chunking strategies (Article 8) - the critical factor that determines RAG quality.

DEV Community

Retrieval-Augmented Generation: Connecting LLMs to Your Data

📚 Tech Acronyms Reference

🎯 Introduction: The Knowledge Problem

🏗️ Part 1: RAG Architecture

The Three-Stage Pipeline

The RAG Workflow

💻 Part 2: Building Your First RAG System

Step 1: Setup and Installation

Step 2: Document Loading and Chunking

Step 3: Create Embeddings and Vector Database

Step 4: Complete RAG Pipeline

Step 5: Advanced RAG with Metadata Filtering

⚖️ Part 3: RAG vs Alternatives

Decision Framework

Comparison Table

When to Use RAG

Real-World Example: Customer Support System

🎯 Conclusion: RAG as Production Foundation

Key Takeaways for Data Engineers

The RAG ROI Pattern

Top comments (0)