Introduction
Retrieval Augmented Generation, is one of the biggest pillars in todays AI field. Mainly used by big companies for better internal gestion and retrieval of documents.
In this article I will be explaing some RAG concepts with code snippets for a better grasp, and also be talking about some common problems I faced when implementing my own RAG, and presenting some solutions all along.
What is RAG?
RAG (Retrieval Augmented Generation) is a system design pattern that combines:
- Information retrieval (finding relevant knowledge)
- Large Language Models (LLMs) (generating responses)
Instead of relying only on what the model had learned during training, a RAG system retrieves external knowledge and injects it into the prompt.
Traditional LLM
Question
↓
Model Memory (Training Data)
↓
Answer
Problem:
- knowledge can be outdated
- hallucinations happen
- cannot access private company data
RAG based LLM
Question
↓
Retrieve Relevant Knowledge
↓
Add Context to Prompt
↓
LLM Generates Grounded Answer
This makes answers:
- more accurate
- grounded in documents
- customizable
- domain-specific
Why RAG?
LLMs are powerful but limited.
Common problems:
1. Hallucinations
The model invents facts.
Example:
Question:
Who founded Company X?
Answer:
John Smith.
Even if John Smith never existed.
2. Knowledge Cutoff
Models only know what they were trained on.
They do not automatically know:
- your PDFs
- internal documentation
- GitHub repositories
- recent updates
3. Private Data
Businesses need AI over:
- internal docs
- policies
- tickets
- codebases
RAG solves this.
Core Architecture
A RAG system usually contains:
- Documents
- Chunking system
- Embedding model
- Vector database
- Retriever
- Prompt constructor
- LLM
Architecture:
Documents
↓
Chunking
↓
Embeddings
↓
Vector Database
User Question
↓
Question Embedding
↓
Similarity Search
↓
Relevant Chunks
↓
Prompt Construction
↓
LLM
↓
Answer
How RAG Works Step by Step
1. Documents
The system starts with raw documents.
Examples:
- TXT files
- PDFs
- Markdown files
- HTML pages
- GitHub repos
Example text:
RAG systems use vector databases to retrieve
relevant information for LLMs.
2. Chunking
Documents are split into smaller sections.
Why?
Embedding entire books is ineffective.
Instead:
Large Document
↓
Small Chunks
Example:
Chunk 1 → Intro
Chunk 2 → Embeddings
Chunk 3 → Pinecone
3. Embeddings
Every chunk becomes a vector.
Example:
"RAG systems use retrieval"
becomes:
[0.12, -0.77, 0.48, ...]
4. Store in Vector Database
Vectors are stored in:
- Pinecone
- Weaviate
- Qdrant
- Chroma
- FAISS
5. User Question
Example:
What are embeddings?
Question becomes a vector too.
6. Similarity Search
The vector database finds:
Most similar chunks
based on mathematical similarity.
7. Prompt Construction
Retrieved chunks are injected into prompt.
Example:
Context:
Embeddings are vector representations.
Question:
What are embeddings?
8. LLM Generation
The LLM generates an answer using retrieved context.
Key Concepts and Definitions
1. Embedding
A numerical semantic representation of text.
Example:
"Machine learning"
↓
[0.12, -0.34, ...]
Purpose:
- semantic understanding
- similarity search
2. Vector
An ordered list of numbers.
Example:
[0.12, -0.55, 0.91]
3. Dimension
The number of values inside a vector.
Example:
768-dimensional vector
means:
768 numbers
Why it matters:
Your vector DB dimension must match embedding dimension.
Example:
nomic-embed-text → 768
Pinecone index → must be 768
4. Semantic Search
Search by meaning.
Not exact keywords.
Example:
Question:
How does memory work?
Can retrieve:
Agents retain context using memory systems.
5. Similarity Score
Measures closeness between vectors.
Higher score:
More relevant
Top-K
How many results to retrieve.
Example:
top_k=5
Means:
Return best 5 chunks
6. Metadata
Extra information attached to vectors.
Example:
{
"text": "Embeddings are vectors",
"source": "notes.txt",
"topic": "rag"
}
Embeddings Explained
Embeddings convert text into mathematical meaning.
Texts with similar meanings end up close together.
Example:
"How to build AI agents"
and
"Creating autonomous agents"
become nearby vectors.
Generating Embeddings with Ollama
import ollama
def generate_embedding(text):
response = ollama.embeddings(
model="nomic-embed-text",
prompt=text
)
return response["embedding"]
Test:
embedding = generate_embedding(
"What is RAG?"
)
print(len(embedding))
print(embedding[:10])
The code snippets seen above are from a RAG project I implemented, you can view the source code here
Vector Databases
A vector database stores embeddings.
Traditional DB:
Search by exact values
Vector DB:
Search by similarity
Common vector DBs:
- Pinecone
- Qdrant
- Weaviate
- Chroma
- FAISS
Chunking
Chunking is splitting documents.
1. Why Chunking Matters
Bad chunking = bad retrieval.
Example problem:
Chunk 1:
RAG systems use semantic
Chunk 2:
search through vectors
Meaning gets broken.
2. Character-Based Chunking
def chunk_text(text,
chunk_size=800,
overlap=150):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - overlap
return chunks
3. Overlap
Preserves context.
Example:
Chunk 1 → 0-800
Chunk 2 → 650-1450
Overlap:
150 characters
Similarity Search
Pinecone compares vectors.
Usually using:
Cosine Similarity
Measures angle similarity.
Similar meaning:
High cosine score
Retrieval Pipeline
Example retrieval:
query_embedding = generate_embedding(query)
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
Explanation:
vector=query_embedding
Search using question vector.
top_k=5
Retrieve top 5 results.
include_metadata=True
Return original chunk text.
Prompt Augmentation
This is the "augmentation" in RAG.
We inject context.
Example:
context = "\n\n".join(
match["metadata"]["text"]
for match in results["matches"]
)
Prompt Example
prompt = f"""
You are a helpful assistant.
Answer ONLY using the context.
Context:
{context}
Question:
{query}
Answer:
"""
Generation Phase
Send prompt to the LLM.
For me, I used my local LLM Mistral
response = ollama.chat(
model="mistral",
messages=[
{
"role": "user",
"content": prompt
}
]
)
print(response["message"]["content"])
Pinecone Concepts
Below are some Pinecone concepts I used and hope you might find helpful.
1. Index
Container of vectors.
Equivalent to:
Database table
2. Creating Index
from pinecone import Pinecone
pc = Pinecone(api_key=API_KEY)
pc.create_index(
name="rag-demo",
dimension=768,
metric="cosine",
spec={
"serverless": {
"cloud": "aws",
"region": "us-east-1"
}
}
)
3. Upsert
Insert/update vectors.
index.upsert(vectors=vectors)
4. Query
Search vectors.
index.query(...)
5. Delete
Delete vectors.
index.delete(delete_all=True)
Metadata in RAG
Store useful context.
Example:
metadata={
"text": chunk,
"source": "notes.txt",
"section": "embeddings"
}
Useful later for:
- filtering
- citations
- debugging
Best Practices
These are some best practices to follow when building your RAG system:
- Retrieval quality > model quality
- Use metadata
- Keep chunks meaningful
- Avoid tiny chunks
- Re-index after document updates
- Use overlap
- Start simple before frameworks
- Debug retrieval separately from generation
However there is some considerations, as real production RAG systems often add features not present in my personal simple RAG system, such as:
- authentication
- streaming
- caching
- citations
- reranking
- hybrid search
- observability
- evaluation pipelines
- vector versioning
- document syncing
Glossary
| Term | Meaning |
|---|---|
| RAG | Retrieval-Augmented Generation |
| Embedding | Numerical representation of text |
| Vector | Ordered list of numbers |
| Dimension | Number of values in vector |
| Chunk | Small document section |
| Metadata | Extra vector information |
| Top-K | Number of retrieved results |
| Similarity Search | Finding closest vectors |
| Cosine Similarity | Vector closeness metric |
| Index | Pinecone vector collection |
| Upsert | Insert/update vector |
| Retrieval | Finding relevant knowledge |
| Generation | Producing final answer |
| Hallucination | Fabricated answer |
| Reranking | Reordering retrieved chunks |
| Hybrid Search | Semantic + keyword retrieval |
Conclusion
Dear reader, I hope my POV of RAGs helped you even a little bit to understand how these systems work under the hood from embedding to retrieving to generating the proper response.
And this is the essence of a RAG system.
Top comments (0)