Simran Shaikh

Posted on Jan 30

RAG & Semantic Search

#rag #llm #ai #tutorial

In the rapidly evolving world of AI and large language models, Retrieval-Augmented Generation (RAG) has emerged as a game-changing technique. If you're building AI applications that need to understand and search through your own data, this guide will walk you through every essential concept you need to know.

Introduction: Why RAG Matters

Traditional language models have a fundamental limitation: they only know what they were trained on. RAG solves this by teaching AI systems to retrieve and use external knowledge before generating answers. Think of it as giving your AI a library card instead of just relying on what it memorized in school.

Let's dive into the 20 core concepts that make RAG and semantic search work.

1. Embeddings: Teaching Machines to Understand Meaning

What are embeddings?

An embedding is a numerical representation of data—whether text, images, or audio—that preserves the underlying meaning. Instead of treating words as arbitrary symbols, embeddings capture their semantic relationships.

For example, the sentence "Neural networks learn patterns" might become:

[0.12, -0.45, 0.88, 0.34, -0.67, ...]

Why do we need them?

Computers don't inherently understand language. Embeddings bridge this gap by:

Enabling meaningful comparisons between pieces of text
Clustering similar concepts together
Powering semantic search capabilities

Types of embeddings:

Text embeddings: For documents, queries, and general text
Image embeddings: For visual content like diagrams and photos
Multimodal embeddings: Combining text and images (e.g., CLIP)

Popular models:

OpenAI's text-embedding-3-large
Open-source options like BGE, E5, and MiniLM
CLIP for image embeddings

2. Semantic Search: Beyond Keywords

The fundamental shift

Traditional keyword search looks for exact word matches. Semantic search understands meaning.

Example in action:

Query: "How does backpropagation work?"

A document containing "Gradient descent updates weights during neural network training" would be found by semantic search even though it shares no exact keywords with the query. This is the power of understanding meaning over matching words.

How it works:

Convert all documents into embeddings
Convert the user's query into an embedding
Compare the query vector with document vectors
Return the most semantically similar results

3. Vectors: The Language of AI

A vector is simply a list of numbers, like [0.32, -0.14, 0.88, ...]. Each dimension in this list captures a different aspect of meaning—think of them as coordinates in a multi-dimensional space of concepts.

When two vectors are close together in this space, their meanings are similar.

4. Vector Databases: Storage Built for Similarity

Why special databases?

Traditional databases excel at exact matches. Vector databases are optimized for a different question: "What's most similar to this?"

When you're dealing with millions of embeddings, you need specialized tools for fast similarity search.

Popular options:

Database	Best For
FAISS	Local development and research
Chroma	Simple applications and prototyping
Pinecone	Cloud-scale production systems
Qdrant	Open-source deployments

5. Similarity Metrics: Measuring Closeness

Cosine similarity is the most common metric for comparing embeddings:

similarity = (A · B) / (|A| × |B|)

The result ranges from -1 to 1:

1: Vectors point in the same direction (very similar)
0: Vectors are perpendicular (unrelated)
-1: Vectors point in opposite directions (opposite meanings)

6. Chunking: Breaking Down Documents

The challenge

Large language models have input limits. A 100-page manual won't fit in a single context window. The solution? Break it into smaller, digestible pieces.

Chunking strategies:

Method	Description	Use Case
Fixed-size	Every 500 tokens	Simple, consistent chunks
Sliding window	Overlapping segments	Preserves context at boundaries
Semantic	Split by topic/paragraph	Maintains logical coherence

Pro tip: Good chunking preserves complete thoughts. Splitting mid-sentence can harm retrieval quality.

7. Indexing: Speed Through Structure

The problem

Without indexing, finding similar vectors means comparing your query against every single document vector. With millions of documents, this becomes impossibly slow.

The solution

Indexing creates data structures that enable fast approximate nearest neighbor search.

Common index types:

HNSW (Hierarchical Navigable Small World): Fast and accurate
IVF (Inverted File Index): Good for large-scale datasets
Flat: Exact search, slower but 100% accurate

8. Reranking: Refinement for Precision

The two-stage approach

Vector search is fast but sometimes imprecise. Reranking adds a second, more careful analysis.

Process:

Vector database returns top 20 candidates
Reranker model scores these 20 more carefully
Return top 5 best matches

Tools for reranking:
Cross-encoder models that jointly analyze the query and each candidate document provide superior accuracy compared to the independent embeddings used in initial retrieval.

9. MMR: Avoiding Redundancy

Maximal Marginal Relevance solves a common problem: what if your top 5 results all say the same thing?

MMR balances two goals:

Relevance: Results should match the query
Diversity: Results shouldn't duplicate each other

This ensures users get comprehensive information, not repetitive answers.

10. Metadata Filtering: Adding Structure to Search

Sometimes semantic similarity isn't enough. You might want results from a specific source, time period, or category.

Example metadata:

{
  "content": "The compressor operates at 150 PSI...",
  "source": "technical_manual.pdf",
  "page": 12,
  "topic": "compressor",
  "date": "2024-01-15"
}

Filtered query: "Find information about compressors, but only from the technical manual"

This combines semantic search with structured filtering for more precise results.

11. Cross-Encoders vs. Bi-Encoders

Two architectures for comparison:

Type	How It Works	Speed	Accuracy
Bi-encoder	Encodes query and document separately	Fast	Good
Cross-encoder	Processes query and document together	Slow	Excellent

Usage pattern:

Use bi-encoders (standard embeddings) for initial retrieval
Use cross-encoders for reranking the top results

12. Hybrid Search: Best of Both Worlds

Pure semantic search has a weakness: it might miss exact technical terms or specific phrases.

Hybrid search combines:

Keyword search (BM25): Catches exact terms and rare phrases
Vector search: Understands meaning and context

Example: A query for "Python asyncio" benefits from:

Keyword search finding exact mentions of "asyncio"
Semantic search finding related concepts like "asynchronous programming"

13. Knowledge Graphs: Structured Relationships

While vectors capture similarity, knowledge graphs capture relationships.

Structure:

Nodes represent entities (concepts, people, things)
Edges represent relationships between them

Example:

Transformer --uses--> Self-Attention
Self-Attention --enables--> Parallel Processing

Applications:

Graph RAG for multi-hop reasoning
Scientific knowledge representation
Complex question answering

14. Prompts and Context: Controlling Generation

Context consists of the chunks retrieved from your knowledge base.

Prompts are instructions that tell the LLM how to use that context.

Example prompt:

Answer the following question using ONLY the context provided below. 
If the answer cannot be found in the context, say "I don't know."

Context: [retrieved chunks]

Question: [user query]

Well-crafted prompts are essential for preventing hallucinations and ensuring grounded responses.

15. Hallucination: The Challenge RAG Solves

The problem: Language models can generate plausible-sounding but entirely fabricated information.

RAG's solution:

Ground responses in retrieved documents
Include citations to sources
Use prompts that enforce context-only answers

RAG doesn't eliminate hallucinations entirely, but it dramatically reduces them by anchoring the model to factual sources.

16. Tokens: The Currency of Language Models

A token is roughly equivalent to a word fragment. The sentence "Artificial Intelligence is transforming technology" might be tokenized as:

["Art", "ificial", " Intelligence", " is", " transform", "ing", " technology"]

Why it matters:

LLMs have token limits (e.g., 128K tokens for GPT-4)
Token count affects both cost and performance
Understanding tokenization helps optimize chunk sizes

17. Temperature: Controlling Creativity

Temperature is a parameter that controls the randomness of model outputs:

Temperature	Behavior	Use Case
0.0	Deterministic, factual	RAG systems, factual Q&A
0.7	Balanced	General conversation
1.0+	Creative, varied	Creative writing

For RAG applications, lower temperatures (0-0.3) typically work best.

18. Top-k: How Many Results to Retrieve

The top_k parameter determines how many documents to retrieve from your vector database.

Considerations:

Too few (k=1-2): Risk missing relevant information
Too many (k=50+): Include noise, increase costs
Sweet spot: Often k=3-10 depending on your use case

Experiment to find the right balance for your application.

19. Evaluation Metrics: Measuring Success

How do you know if your RAG system is working well?

Key metrics:

Metric	What It Measures
Recall@k	Are the right documents in the top-k results?
MRR (Mean Reciprocal Rank)	How quickly do we find the first relevant result?
NDCG	Overall quality of the ranking
Answer relevance	Does the final answer address the question?
Faithfulness	Is the answer grounded in the retrieved context?

Regular evaluation ensures your system maintains quality as your knowledge base grows.

20. The RAG Pipeline: Putting It All Together

A complete RAG system follows this flow:

1. Ingestion Phase:

Load documents
Split into chunks
Generate embeddings
Store in vector database with metadata

2. Retrieval Phase:

User submits a query
Convert query to embedding
Search vector database
Apply metadata filters
Rerank results
Apply MMR for diversity

3. Generation Phase:

Construct prompt with retrieved context
Call LLM with controlled temperature
Generate response with citations
Return to user

Each step is crucial for building a system that's both accurate and reliable.

Conclusion: The Power of RAG

At its core, RAG and semantic search represent a fundamental shift in how we build AI applications. Instead of hoping a pre-trained model knows everything, we give it the ability to learn from our specific data in real-time.

The one-sentence summary:

RAG + Semantic Search = Teaching AI to read your data before answering

Whether you're building a customer support bot, a research assistant, or an internal knowledge management system, understanding these 20 concepts gives you the foundation to create intelligent, grounded, and reliable AI applications.

Next Steps

Ready to go deeper? Consider:

Building a simple RAG system using LangChain or LlamaIndex
Experimenting with different embedding models to see what works for your domain
Implementing evaluation metrics to measure and improve your system
Exploring advanced techniques like Graph RAG or multi-query retrieval

The field is evolving rapidly, but these fundamentals will serve you well no matter which direction it takes.

Have questions or want to share your RAG implementation experiences? Let's discuss in the comments below!