Boussaden Taha

Posted on May 16

RAG - Complete Practical Guide

#rag #ai #llm #pinecone

Introduction

Retrieval Augmented Generation, is one of the biggest pillars in todays AI field. Mainly used by big companies for better internal gestion and retrieval of documents.
In this article I will be explaing some RAG concepts with code snippets for a better grasp, and also be talking about some common problems I faced when implementing my own RAG, and presenting some solutions all along.

What is RAG?

RAG (Retrieval Augmented Generation) is a system design pattern that combines:

Information retrieval (finding relevant knowledge)
Large Language Models (LLMs) (generating responses)

Instead of relying only on what the model had learned during training, a RAG system retrieves external knowledge and injects it into the prompt.

Traditional LLM

Question
   ↓
Model Memory (Training Data)
   ↓
Answer

Problem:

knowledge can be outdated
hallucinations happen
cannot access private company data

RAG based LLM

Question
   ↓
Retrieve Relevant Knowledge
   ↓
Add Context to Prompt
   ↓
LLM Generates Grounded Answer

This makes answers:

more accurate
grounded in documents
customizable
domain-specific

Why RAG?

LLMs are powerful but limited.

Common problems:

1. Hallucinations

The model invents facts.

Example:

Question:
Who founded Company X?

Answer:
John Smith.

Even if John Smith never existed.

2. Knowledge Cutoff

Models only know what they were trained on.

They do not automatically know:

your PDFs
internal documentation
GitHub repositories
recent updates

3. Private Data

Businesses need AI over:

internal docs
policies
tickets
codebases

RAG solves this.

Core Architecture

A RAG system usually contains:

Documents
Chunking system
Embedding model
Vector database
Retriever
Prompt constructor
LLM

Architecture:

Documents
   ↓
Chunking
   ↓
Embeddings
   ↓
Vector Database

User Question
   ↓
Question Embedding
   ↓
Similarity Search
   ↓
Relevant Chunks
   ↓
Prompt Construction
   ↓
LLM
   ↓
Answer

How RAG Works Step by Step

1. Documents

The system starts with raw documents.

Examples:

TXT files
PDFs
Markdown files
HTML pages
GitHub repos

Example text:

RAG systems use vector databases to retrieve
relevant information for LLMs.

2. Chunking

Documents are split into smaller sections.

Why?

Embedding entire books is ineffective.

Instead:

Large Document
   ↓
Small Chunks

Example:

Chunk 1 → Intro
Chunk 2 → Embeddings
Chunk 3 → Pinecone

3. Embeddings

Every chunk becomes a vector.

Example:

"RAG systems use retrieval"

becomes:

[0.12, -0.77, 0.48, ...]

4. Store in Vector Database

Vectors are stored in:

Pinecone
Weaviate
Qdrant
Chroma
FAISS

5. User Question

Example:

What are embeddings?

Question becomes a vector too.

6. Similarity Search

The vector database finds:

Most similar chunks

based on mathematical similarity.

7. Prompt Construction

Retrieved chunks are injected into prompt.

Example:

Context:
Embeddings are vector representations.

Question:
What are embeddings?

8. LLM Generation

The LLM generates an answer using retrieved context.

Key Concepts and Definitions

1. Embedding

A numerical semantic representation of text.

Example:

"Machine learning"
↓
[0.12, -0.34, ...]

Purpose:

semantic understanding
similarity search

2. Vector

An ordered list of numbers.

Example:

[0.12, -0.55, 0.91]

3. Dimension

The number of values inside a vector.

Example:

768-dimensional vector

means:

768 numbers

Why it matters:

Your vector DB dimension must match embedding dimension.

Example:

nomic-embed-text → 768
Pinecone index → must be 768

4. Semantic Search

Search by meaning.

Not exact keywords.

Example:

Question:

How does memory work?

Can retrieve:

Agents retain context using memory systems.

5. Similarity Score

Measures closeness between vectors.

Higher score:

More relevant

Top-K

How many results to retrieve.

Example:

top_k=5

Means:

Return best 5 chunks

6. Metadata

Extra information attached to vectors.

Example:

{
  "text": "Embeddings are vectors",
  "source": "notes.txt",
  "topic": "rag"
}

Embeddings Explained

Embeddings convert text into mathematical meaning.

Texts with similar meanings end up close together.

Example:

"How to build AI agents"

and

"Creating autonomous agents"

become nearby vectors.

Generating Embeddings with Ollama

import ollama


def generate_embedding(text):
    response = ollama.embeddings(
        model="nomic-embed-text",
        prompt=text
    )

    return response["embedding"]

Test:

embedding = generate_embedding(
    "What is RAG?"
)

print(len(embedding))
print(embedding[:10])

The code snippets seen above are from a RAG project I implemented, you can view the source code here

Vector Databases

A vector database stores embeddings.

Traditional DB:

Search by exact values

Vector DB:

Search by similarity

Common vector DBs:

Pinecone
Qdrant
Weaviate
Chroma
FAISS

Chunking

Chunking is splitting documents.

1. Why Chunking Matters

Bad chunking = bad retrieval.

Example problem:

Chunk 1:
RAG systems use semantic

Chunk 2:
search through vectors

Meaning gets broken.

2. Character-Based Chunking

def chunk_text(text,
               chunk_size=800,
               overlap=150):

    chunks = []
    start = 0

    while start < len(text):

        end = start + chunk_size

        chunk = text[start:end]
        chunks.append(chunk)

        start += chunk_size - overlap

    return chunks

3. Overlap

Preserves context.

Example:

Chunk 1 → 0-800
Chunk 2 → 650-1450

Overlap:

150 characters

Similarity Search

Pinecone compares vectors.

Usually using:

Cosine Similarity

Measures angle similarity.

Similar meaning:

High cosine score

Retrieval Pipeline

Example retrieval:

query_embedding = generate_embedding(query)

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

Explanation:

vector=query_embedding

Search using question vector.

top_k=5

Retrieve top 5 results.

include_metadata=True

Return original chunk text.

Prompt Augmentation

This is the "augmentation" in RAG.

We inject context.

Example:

context = "\n\n".join(
    match["metadata"]["text"]
    for match in results["matches"]
)

Prompt Example

prompt = f"""
You are a helpful assistant.

Answer ONLY using the context.

Context:
{context}

Question:
{query}

Answer:
"""

Generation Phase

Send prompt to the LLM.
For me, I used my local LLM Mistral

response = ollama.chat(
    model="mistral",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

print(response["message"]["content"])

Pinecone Concepts

Below are some Pinecone concepts I used and hope you might find helpful.

1. Index

Container of vectors.

Equivalent to:

Database table

2. Creating Index

from pinecone import Pinecone

pc = Pinecone(api_key=API_KEY)

pc.create_index(
    name="rag-demo",
    dimension=768,
    metric="cosine",
    spec={
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    }
)

3. Upsert

Insert/update vectors.

index.upsert(vectors=vectors)

4. Query

Search vectors.

index.query(...)

5. Delete

Delete vectors.

index.delete(delete_all=True)

Metadata in RAG

Store useful context.

Example:

metadata={
    "text": chunk,
    "source": "notes.txt",
    "section": "embeddings"
}

Useful later for:

filtering
citations
debugging

Best Practices

These are some best practices to follow when building your RAG system:

Retrieval quality > model quality
Use metadata
Keep chunks meaningful
Avoid tiny chunks
Re-index after document updates
Use overlap
Start simple before frameworks
Debug retrieval separately from generation

However there is some considerations, as real production RAG systems often add features not present in my personal simple RAG system, such as:

authentication
streaming
caching
citations
reranking
hybrid search
observability
evaluation pipelines
vector versioning
document syncing

Glossary

Term	Meaning
RAG	Retrieval-Augmented Generation
Embedding	Numerical representation of text
Vector	Ordered list of numbers
Dimension	Number of values in vector
Chunk	Small document section
Metadata	Extra vector information
Top-K	Number of retrieved results
Similarity Search	Finding closest vectors
Cosine Similarity	Vector closeness metric
Index	Pinecone vector collection
Upsert	Insert/update vector
Retrieval	Finding relevant knowledge
Generation	Producing final answer
Hallucination	Fabricated answer
Reranking	Reordering retrieved chunks
Hybrid Search	Semantic + keyword retrieval

Conclusion

Dear reader, I hope my POV of RAGs helped you even a little bit to understand how these systems work under the hood from embedding to retrieving to generating the proper response.
And this is the essence of a RAG system.