DEV Community

Simran Shaikh
Simran Shaikh Subscriber

Posted on

RAG & Semantic Search

In the rapidly evolving world of AI and large language models, Retrieval-Augmented Generation (RAG) has emerged as a game-changing technique. If you're building AI applications that need to understand and search through your own data, this guide will walk you through every essential concept you need to know.

Introduction: Why RAG Matters

Traditional language models have a fundamental limitation: they only know what they were trained on. RAG solves this by teaching AI systems to retrieve and use external knowledge before generating answers. Think of it as giving your AI a library card instead of just relying on what it memorized in school.

Let's dive into the 20 core concepts that make RAG and semantic search work.


1. Embeddings: Teaching Machines to Understand Meaning

What are embeddings?

An embedding is a numerical representation of data—whether text, images, or audio—that preserves the underlying meaning. Instead of treating words as arbitrary symbols, embeddings capture their semantic relationships.

For example, the sentence "Neural networks learn patterns" might become:

[0.12, -0.45, 0.88, 0.34, -0.67, ...]
Enter fullscreen mode Exit fullscreen mode

Why do we need them?

Computers don't inherently understand language. Embeddings bridge this gap by:

  • Enabling meaningful comparisons between pieces of text
  • Clustering similar concepts together
  • Powering semantic search capabilities

Types of embeddings:

  • Text embeddings: For documents, queries, and general text
  • Image embeddings: For visual content like diagrams and photos
  • Multimodal embeddings: Combining text and images (e.g., CLIP)

Popular models:

  • OpenAI's text-embedding-3-large
  • Open-source options like BGE, E5, and MiniLM
  • CLIP for image embeddings

2. Semantic Search: Beyond Keywords

The fundamental shift

Traditional keyword search looks for exact word matches. Semantic search understands meaning.

Example in action:

Query: "How does backpropagation work?"

A document containing "Gradient descent updates weights during neural network training" would be found by semantic search even though it shares no exact keywords with the query. This is the power of understanding meaning over matching words.

How it works:

  1. Convert all documents into embeddings
  2. Convert the user's query into an embedding
  3. Compare the query vector with document vectors
  4. Return the most semantically similar results

3. Vectors: The Language of AI

A vector is simply a list of numbers, like [0.32, -0.14, 0.88, ...]. Each dimension in this list captures a different aspect of meaning—think of them as coordinates in a multi-dimensional space of concepts.

When two vectors are close together in this space, their meanings are similar.


4. Vector Databases: Storage Built for Similarity

Why special databases?

Traditional databases excel at exact matches. Vector databases are optimized for a different question: "What's most similar to this?"

When you're dealing with millions of embeddings, you need specialized tools for fast similarity search.

Popular options:

Database Best For
FAISS Local development and research
Chroma Simple applications and prototyping
Pinecone Cloud-scale production systems
Qdrant Open-source deployments

5. Similarity Metrics: Measuring Closeness

Cosine similarity is the most common metric for comparing embeddings:

similarity = (A · B) / (|A| × |B|)
Enter fullscreen mode Exit fullscreen mode

The result ranges from -1 to 1:

  • 1: Vectors point in the same direction (very similar)
  • 0: Vectors are perpendicular (unrelated)
  • -1: Vectors point in opposite directions (opposite meanings)

6. Chunking: Breaking Down Documents

The challenge

Large language models have input limits. A 100-page manual won't fit in a single context window. The solution? Break it into smaller, digestible pieces.

Chunking strategies:

Method Description Use Case
Fixed-size Every 500 tokens Simple, consistent chunks
Sliding window Overlapping segments Preserves context at boundaries
Semantic Split by topic/paragraph Maintains logical coherence

Pro tip: Good chunking preserves complete thoughts. Splitting mid-sentence can harm retrieval quality.


7. Indexing: Speed Through Structure

The problem

Without indexing, finding similar vectors means comparing your query against every single document vector. With millions of documents, this becomes impossibly slow.

The solution

Indexing creates data structures that enable fast approximate nearest neighbor search.

Common index types:

  • HNSW (Hierarchical Navigable Small World): Fast and accurate
  • IVF (Inverted File Index): Good for large-scale datasets
  • Flat: Exact search, slower but 100% accurate

8. Reranking: Refinement for Precision

The two-stage approach

Vector search is fast but sometimes imprecise. Reranking adds a second, more careful analysis.

Process:

  1. Vector database returns top 20 candidates
  2. Reranker model scores these 20 more carefully
  3. Return top 5 best matches

Tools for reranking:
Cross-encoder models that jointly analyze the query and each candidate document provide superior accuracy compared to the independent embeddings used in initial retrieval.


9. MMR: Avoiding Redundancy

Maximal Marginal Relevance solves a common problem: what if your top 5 results all say the same thing?

MMR balances two goals:

  1. Relevance: Results should match the query
  2. Diversity: Results shouldn't duplicate each other

This ensures users get comprehensive information, not repetitive answers.


10. Metadata Filtering: Adding Structure to Search

Sometimes semantic similarity isn't enough. You might want results from a specific source, time period, or category.

Example metadata:

{
  "content": "The compressor operates at 150 PSI...",
  "source": "technical_manual.pdf",
  "page": 12,
  "topic": "compressor",
  "date": "2024-01-15"
}
Enter fullscreen mode Exit fullscreen mode

Filtered query: "Find information about compressors, but only from the technical manual"

This combines semantic search with structured filtering for more precise results.


11. Cross-Encoders vs. Bi-Encoders

Two architectures for comparison:

Type How It Works Speed Accuracy
Bi-encoder Encodes query and document separately Fast Good
Cross-encoder Processes query and document together Slow Excellent

Usage pattern:

  • Use bi-encoders (standard embeddings) for initial retrieval
  • Use cross-encoders for reranking the top results

12. Hybrid Search: Best of Both Worlds

Pure semantic search has a weakness: it might miss exact technical terms or specific phrases.

Hybrid search combines:

  • Keyword search (BM25): Catches exact terms and rare phrases
  • Vector search: Understands meaning and context

Example: A query for "Python asyncio" benefits from:

  • Keyword search finding exact mentions of "asyncio"
  • Semantic search finding related concepts like "asynchronous programming"

13. Knowledge Graphs: Structured Relationships

While vectors capture similarity, knowledge graphs capture relationships.

Structure:

  • Nodes represent entities (concepts, people, things)
  • Edges represent relationships between them

Example:

Transformer --uses--> Self-Attention
Self-Attention --enables--> Parallel Processing
Enter fullscreen mode Exit fullscreen mode

Applications:

  • Graph RAG for multi-hop reasoning
  • Scientific knowledge representation
  • Complex question answering

14. Prompts and Context: Controlling Generation

Context consists of the chunks retrieved from your knowledge base.

Prompts are instructions that tell the LLM how to use that context.

Example prompt:

Answer the following question using ONLY the context provided below. 
If the answer cannot be found in the context, say "I don't know."

Context: [retrieved chunks]

Question: [user query]
Enter fullscreen mode Exit fullscreen mode

Well-crafted prompts are essential for preventing hallucinations and ensuring grounded responses.


15. Hallucination: The Challenge RAG Solves

The problem: Language models can generate plausible-sounding but entirely fabricated information.

RAG's solution:

  • Ground responses in retrieved documents
  • Include citations to sources
  • Use prompts that enforce context-only answers

RAG doesn't eliminate hallucinations entirely, but it dramatically reduces them by anchoring the model to factual sources.


16. Tokens: The Currency of Language Models

A token is roughly equivalent to a word fragment. The sentence "Artificial Intelligence is transforming technology" might be tokenized as:

["Art", "ificial", " Intelligence", " is", " transform", "ing", " technology"]
Enter fullscreen mode Exit fullscreen mode

Why it matters:

  • LLMs have token limits (e.g., 128K tokens for GPT-4)
  • Token count affects both cost and performance
  • Understanding tokenization helps optimize chunk sizes

17. Temperature: Controlling Creativity

Temperature is a parameter that controls the randomness of model outputs:

Temperature Behavior Use Case
0.0 Deterministic, factual RAG systems, factual Q&A
0.7 Balanced General conversation
1.0+ Creative, varied Creative writing

For RAG applications, lower temperatures (0-0.3) typically work best.


18. Top-k: How Many Results to Retrieve

The top_k parameter determines how many documents to retrieve from your vector database.

Considerations:

  • Too few (k=1-2): Risk missing relevant information
  • Too many (k=50+): Include noise, increase costs
  • Sweet spot: Often k=3-10 depending on your use case

Experiment to find the right balance for your application.


19. Evaluation Metrics: Measuring Success

How do you know if your RAG system is working well?

Key metrics:

Metric What It Measures
Recall@k Are the right documents in the top-k results?
MRR (Mean Reciprocal Rank) How quickly do we find the first relevant result?
NDCG Overall quality of the ranking
Answer relevance Does the final answer address the question?
Faithfulness Is the answer grounded in the retrieved context?

Regular evaluation ensures your system maintains quality as your knowledge base grows.


20. The RAG Pipeline: Putting It All Together

A complete RAG system follows this flow:

1. Ingestion Phase:

  • Load documents
  • Split into chunks
  • Generate embeddings
  • Store in vector database with metadata

2. Retrieval Phase:

  • User submits a query
  • Convert query to embedding
  • Search vector database
  • Apply metadata filters
  • Rerank results
  • Apply MMR for diversity

3. Generation Phase:

  • Construct prompt with retrieved context
  • Call LLM with controlled temperature
  • Generate response with citations
  • Return to user

Each step is crucial for building a system that's both accurate and reliable.


Conclusion: The Power of RAG

At its core, RAG and semantic search represent a fundamental shift in how we build AI applications. Instead of hoping a pre-trained model knows everything, we give it the ability to learn from our specific data in real-time.

The one-sentence summary:

RAG + Semantic Search = Teaching AI to read your data before answering

Whether you're building a customer support bot, a research assistant, or an internal knowledge management system, understanding these 20 concepts gives you the foundation to create intelligent, grounded, and reliable AI applications.


Next Steps

Ready to go deeper? Consider:

  1. Building a simple RAG system using LangChain or LlamaIndex
  2. Experimenting with different embedding models to see what works for your domain
  3. Implementing evaluation metrics to measure and improve your system
  4. Exploring advanced techniques like Graph RAG or multi-query retrieval

The field is evolving rapidly, but these fundamentals will serve you well no matter which direction it takes.


Have questions or want to share your RAG implementation experiences? Let's discuss in the comments below!

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.