DEV Community

Cover image for RAG - Complete Practical Guide
Boussaden Taha
Boussaden Taha

Posted on

RAG - Complete Practical Guide

Introduction

Retrieval Augmented Generation, is one of the biggest pillars in todays AI field. Mainly used by big companies for better internal gestion and retrieval of documents.
In this article I will be explaing some RAG concepts with code snippets for a better grasp, and also be talking about some common problems I faced when implementing my own RAG, and presenting some solutions all along.

What is RAG?

RAG (Retrieval Augmented Generation) is a system design pattern that combines:

  • Information retrieval (finding relevant knowledge)
  • Large Language Models (LLMs) (generating responses)

Instead of relying only on what the model had learned during training, a RAG system retrieves external knowledge and injects it into the prompt.

Traditional LLM

Question
   ↓
Model Memory (Training Data)
   ↓
Answer
Enter fullscreen mode Exit fullscreen mode

Problem:

  • knowledge can be outdated
  • hallucinations happen
  • cannot access private company data

RAG based LLM

Question
   ↓
Retrieve Relevant Knowledge
   ↓
Add Context to Prompt
   ↓
LLM Generates Grounded Answer
Enter fullscreen mode Exit fullscreen mode

This makes answers:

  • more accurate
  • grounded in documents
  • customizable
  • domain-specific

Why RAG?

LLMs are powerful but limited.

Common problems:

1. Hallucinations

The model invents facts.

Example:

Question:
Who founded Company X?

Answer:
John Smith.
Enter fullscreen mode Exit fullscreen mode

Even if John Smith never existed.

2. Knowledge Cutoff

Models only know what they were trained on.

They do not automatically know:

  • your PDFs
  • internal documentation
  • GitHub repositories
  • recent updates

3. Private Data

Businesses need AI over:

  • internal docs
  • policies
  • tickets
  • codebases

RAG solves this.

Core Architecture

A RAG system usually contains:

  1. Documents
  2. Chunking system
  3. Embedding model
  4. Vector database
  5. Retriever
  6. Prompt constructor
  7. LLM

Architecture:

Documents
   ↓
Chunking
   ↓
Embeddings
   ↓
Vector Database

User Question
   ↓
Question Embedding
   ↓
Similarity Search
   ↓
Relevant Chunks
   ↓
Prompt Construction
   ↓
LLM
   ↓
Answer
Enter fullscreen mode Exit fullscreen mode

How RAG Works Step by Step

1. Documents

The system starts with raw documents.

Examples:

  • TXT files
  • PDFs
  • Markdown files
  • HTML pages
  • GitHub repos

Example text:

RAG systems use vector databases to retrieve
relevant information for LLMs.
Enter fullscreen mode Exit fullscreen mode

2. Chunking

Documents are split into smaller sections.

Why?

Embedding entire books is ineffective.

Instead:

Large Document
   ↓
Small Chunks
Enter fullscreen mode Exit fullscreen mode

Example:

Chunk 1 → Intro
Chunk 2 → Embeddings
Chunk 3 → Pinecone
Enter fullscreen mode Exit fullscreen mode

3. Embeddings

Every chunk becomes a vector.

Example:

"RAG systems use retrieval"
Enter fullscreen mode Exit fullscreen mode

becomes:

[0.12, -0.77, 0.48, ...]
Enter fullscreen mode Exit fullscreen mode

4. Store in Vector Database

Vectors are stored in:

  • Pinecone
  • Weaviate
  • Qdrant
  • Chroma
  • FAISS

5. User Question

Example:

What are embeddings?
Enter fullscreen mode Exit fullscreen mode

Question becomes a vector too.

6. Similarity Search

The vector database finds:

Most similar chunks
Enter fullscreen mode Exit fullscreen mode

based on mathematical similarity.

7. Prompt Construction

Retrieved chunks are injected into prompt.

Example:

Context:
Embeddings are vector representations.

Question:
What are embeddings?
Enter fullscreen mode Exit fullscreen mode

8. LLM Generation

The LLM generates an answer using retrieved context.

Key Concepts and Definitions

1. Embedding

A numerical semantic representation of text.

Example:

"Machine learning"
↓
[0.12, -0.34, ...]
Enter fullscreen mode Exit fullscreen mode

Purpose:

  • semantic understanding
  • similarity search

2. Vector

An ordered list of numbers.

Example:

[0.12, -0.55, 0.91]
Enter fullscreen mode Exit fullscreen mode

3. Dimension

The number of values inside a vector.

Example:

768-dimensional vector
Enter fullscreen mode Exit fullscreen mode

means:

768 numbers
Enter fullscreen mode Exit fullscreen mode

Why it matters:

Your vector DB dimension must match embedding dimension.

Example:

nomic-embed-text → 768
Pinecone index → must be 768
Enter fullscreen mode Exit fullscreen mode

4. Semantic Search

Search by meaning.

Not exact keywords.

Example:

Question:

How does memory work?
Enter fullscreen mode Exit fullscreen mode

Can retrieve:

Agents retain context using memory systems.
Enter fullscreen mode Exit fullscreen mode

5. Similarity Score

Measures closeness between vectors.

Higher score:

More relevant
Enter fullscreen mode Exit fullscreen mode

Top-K

How many results to retrieve.

Example:

top_k=5
Enter fullscreen mode Exit fullscreen mode

Means:

Return best 5 chunks
Enter fullscreen mode Exit fullscreen mode

6. Metadata

Extra information attached to vectors.

Example:

{
  "text": "Embeddings are vectors",
  "source": "notes.txt",
  "topic": "rag"
}
Enter fullscreen mode Exit fullscreen mode

Embeddings Explained

Embeddings convert text into mathematical meaning.

Texts with similar meanings end up close together.

Example:

"How to build AI agents"
Enter fullscreen mode Exit fullscreen mode

and

"Creating autonomous agents"
Enter fullscreen mode Exit fullscreen mode

become nearby vectors.

Generating Embeddings with Ollama

import ollama


def generate_embedding(text):
    response = ollama.embeddings(
        model="nomic-embed-text",
        prompt=text
    )

    return response["embedding"]
Enter fullscreen mode Exit fullscreen mode

Test:

embedding = generate_embedding(
    "What is RAG?"
)

print(len(embedding))
print(embedding[:10])
Enter fullscreen mode Exit fullscreen mode

The code snippets seen above are from a RAG project I implemented, you can view the source code here

Vector Databases

A vector database stores embeddings.

Traditional DB:

Search by exact values
Enter fullscreen mode Exit fullscreen mode

Vector DB:

Search by similarity
Enter fullscreen mode Exit fullscreen mode

Common vector DBs:

  • Pinecone
  • Qdrant
  • Weaviate
  • Chroma
  • FAISS

Chunking

Chunking is splitting documents.

1. Why Chunking Matters

Bad chunking = bad retrieval.

Example problem:

Chunk 1:
RAG systems use semantic

Chunk 2:
search through vectors
Enter fullscreen mode Exit fullscreen mode

Meaning gets broken.

2. Character-Based Chunking

def chunk_text(text,
               chunk_size=800,
               overlap=150):

    chunks = []
    start = 0

    while start < len(text):

        end = start + chunk_size

        chunk = text[start:end]
        chunks.append(chunk)

        start += chunk_size - overlap

    return chunks
Enter fullscreen mode Exit fullscreen mode

3. Overlap

Preserves context.

Example:

Chunk 1 → 0-800
Chunk 2 → 650-1450
Enter fullscreen mode Exit fullscreen mode

Overlap:

150 characters
Enter fullscreen mode Exit fullscreen mode

Similarity Search

Pinecone compares vectors.

Usually using:

Cosine Similarity

Measures angle similarity.

Similar meaning:

High cosine score
Enter fullscreen mode Exit fullscreen mode

Retrieval Pipeline

Example retrieval:

query_embedding = generate_embedding(query)

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)
Enter fullscreen mode Exit fullscreen mode

Explanation:

vector=query_embedding
Enter fullscreen mode Exit fullscreen mode

Search using question vector.

top_k=5
Enter fullscreen mode Exit fullscreen mode

Retrieve top 5 results.

include_metadata=True
Enter fullscreen mode Exit fullscreen mode

Return original chunk text.

Prompt Augmentation

This is the "augmentation" in RAG.

We inject context.

Example:

context = "\n\n".join(
    match["metadata"]["text"]
    for match in results["matches"]
)
Enter fullscreen mode Exit fullscreen mode

Prompt Example

prompt = f"""
You are a helpful assistant.

Answer ONLY using the context.

Context:
{context}

Question:
{query}

Answer:
"""
Enter fullscreen mode Exit fullscreen mode

Generation Phase

Send prompt to the LLM.
For me, I used my local LLM Mistral

response = ollama.chat(
    model="mistral",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

print(response["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

Pinecone Concepts

Below are some Pinecone concepts I used and hope you might find helpful.

1. Index

Container of vectors.

Equivalent to:

Database table
Enter fullscreen mode Exit fullscreen mode

2. Creating Index

from pinecone import Pinecone

pc = Pinecone(api_key=API_KEY)

pc.create_index(
    name="rag-demo",
    dimension=768,
    metric="cosine",
    spec={
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    }
)
Enter fullscreen mode Exit fullscreen mode

3. Upsert

Insert/update vectors.

index.upsert(vectors=vectors)
Enter fullscreen mode Exit fullscreen mode

4. Query

Search vectors.

index.query(...)
Enter fullscreen mode Exit fullscreen mode

5. Delete

Delete vectors.

index.delete(delete_all=True)
Enter fullscreen mode Exit fullscreen mode

Metadata in RAG

Store useful context.

Example:

metadata={
    "text": chunk,
    "source": "notes.txt",
    "section": "embeddings"
}
Enter fullscreen mode Exit fullscreen mode

Useful later for:

  • filtering
  • citations
  • debugging

Best Practices

These are some best practices to follow when building your RAG system:

  1. Retrieval quality > model quality
  2. Use metadata
  3. Keep chunks meaningful
  4. Avoid tiny chunks
  5. Re-index after document updates
  6. Use overlap
  7. Start simple before frameworks
  8. Debug retrieval separately from generation

However there is some considerations, as real production RAG systems often add features not present in my personal simple RAG system, such as:

  • authentication
  • streaming
  • caching
  • citations
  • reranking
  • hybrid search
  • observability
  • evaluation pipelines
  • vector versioning
  • document syncing

Glossary

Term Meaning
RAG Retrieval-Augmented Generation
Embedding Numerical representation of text
Vector Ordered list of numbers
Dimension Number of values in vector
Chunk Small document section
Metadata Extra vector information
Top-K Number of retrieved results
Similarity Search Finding closest vectors
Cosine Similarity Vector closeness metric
Index Pinecone vector collection
Upsert Insert/update vector
Retrieval Finding relevant knowledge
Generation Producing final answer
Hallucination Fabricated answer
Reranking Reordering retrieved chunks
Hybrid Search Semantic + keyword retrieval

Conclusion

Dear reader, I hope my POV of RAGs helped you even a little bit to understand how these systems work under the hood from embedding to retrieving to generating the proper response.
And this is the essence of a RAG system.

Top comments (0)