DEV Community

Cover image for Why RAG is the Future of Search (And How Elastic Search Makes it Possible )

Why RAG is the Future of Search (And How Elastic Search Makes it Possible )

The Problem RAG Solves

Large Language Models (LLMs) like are incredibly powerful, but they have a critical limitation: they can only answer based on what they were trained on. This leads to two major issues:

  1. Hallucinations: The model might confidently provide incorrect information
  2. Limited Knowledge: The model doesn't know about your specific data, documents, or recent information

RAG solves this by giving the LLM a "brain" made of your specific data.


The Three Steps of RAG

Step 1: RETRIEVAL (The "Librarian")

What happens:

  • When you ask a question, we DON'T send it to the AI immediately
  • Instead, we first ask Elasticsearch to find the exact "pages" from our data that match the intent of the question

How it works:

  1. Your question is converted into a vector (a list of numbers representing meaning) using OpenAI's embedding model
  2. Elasticsearch performs a KNN (K-Nearest Neighbors) search to find documents with similar meaning
  3. Returns the top 5 most relevant document chunks

Key Teaching Point:

"Elastic isn't looking for words; it's looking for meaning using Vector Search. This is semantic search, not keyword search."

// Convert question to vector
const questionEmbedding = await generateEmbedding(question);

// Search Elasticsearch using KNN
const searchResponse = await elasticClient.search({
  index: INDEX_NAME,
  knn: {
    field: "embedding",
    query_vector: questionEmbedding,
    k: 5, // Return top 5 most similar documents
    num_candidates: 100,
  },
});
Enter fullscreen mode Exit fullscreen mode

Step 2: AUGMENTATION (The "Context Window")

What happens:

  • Elastic combines the search results with your original question
  • This creates a "context window" that tells the LLM exactly what information to use

How it works:

  1. Take the retrieved document chunks
  2. Format them with their source information
  3. Combine them into a single context string
  4. Add your question to this context

Key Teaching Point:

"We are 'augmenting' the AI. We're telling it: 'Here is the data you need. Only use this info to answer the question.' This prevents hallucinations."

// Format context from retrieved documents
const context = relevantDocs
  .map((doc: any) => `[Source: ${doc.title}]\n${doc.content}`)
  .join("\n\n---\n\n");
Enter fullscreen mode Exit fullscreen mode

Step 3: GENERATION (The "Speaker")

What happens:

  • OpenAI's GPT-4 takes the context and your question
  • It generates a human-sounding, accurate response based ONLY on the provided context

How it works:

  1. Send a system prompt that instructs the LLM to only use the provided context
  2. Include the retrieved context + your question in the user message
  3. GPT-4 generates a response that summarizes and synthesizes the context

Key Teaching Point:

"The AI isn't guessing anymore. It's summarizing the high-quality data that Elastic provided. This is why RAG is so powerful—it combines the reasoning ability of LLMs with the accuracy of your own data."

const completion = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    {
      role: "system",
      content: `You are a helpful assistant that answers questions based on the provided context... Only use information from the provided context.`,
    },
    {
      role: "user",
      content: `Context from knowledge base:\n\n${context}\n\nQuestion: ${question}`,
    },
  ],
});
Enter fullscreen mode Exit fullscreen mode

How Elastic Acts as "Long-Term Memory"

Traditional LLM (Without RAG):

User Question → GPT-4 → Answer (based on training data only)
Enter fullscreen mode Exit fullscreen mode

Problem: Limited to what GPT-4 was trained on, can't access your documents

RAG with Elastic (With RAG):

User Question
  → Generate Embedding
  → Elasticsearch Vector Search (finds your documents)
  → Combine Context + Question
  → GPT-4 → Answer (grounded in your data)
Enter fullscreen mode Exit fullscreen mode

Solution: Elastic acts as the "long-term memory" that stores and retrieves your specific knowledge

Why Elastic is Perfect for This:

  1. Vector Storage: Elasticsearch's dense_vector field type stores embeddings efficiently
  2. KNN Search: Fast similarity search using cosine similarity
  3. Scalability: Can handle millions of documents
  4. Real-time: Documents are searchable immediately after indexing
  5. Hybrid Search: Can combine vector search with traditional keyword search

The Complete Flow

┌─────────────────┐
│  User Question  │
│  "What is RAG?" │
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│  Step 1: RETRIEVAL      │
│  ─────────────────────   │
│  1. Generate embedding  │
│  2. Search Elasticsearch │
│  3. Get top 5 docs      │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Step 2: AUGMENTATION   │
│  ─────────────────────   │
│  Combine:               │
│  - Retrieved docs       │
│  - Original question    │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Step 3: GENERATION     │
│  ─────────────────────   │
│  GPT-4 generates answer │
│  based on context       │
└────────┬────────────────┘
         │
         ▼
┌─────────────────┐
│  Final Answer   │
│  (with sources) │
└─────────────────┘
Enter fullscreen mode Exit fullscreen mode

🎓 Key Concepts Explained

What are Embeddings?

Embeddings are numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings (vectors that are close together in high-dimensional space).

Example:

  • "dog" and "puppy" → Similar vectors (close in space)
  • "dog" and "airplane" → Different vectors (far apart in space)

What is KNN (K-Nearest Neighbors)?

KNN finds the K most similar vectors to your query vector. In our case:

  • K = 5 means we get the 5 most semantically similar documents
  • Uses cosine similarity to measure "closeness"
  • Returns documents that mean similar things, not just contain similar words

Why Chunk Documents?

Large documents are split into smaller chunks because:

  1. Context Limits: LLMs have token limits
  2. Precision: Smaller chunks allow more precise retrieval
  3. Relevance: You get exactly the relevant section, not the entire document

Why This Demonstrates Elastic's Mission

Elastic is the Search AI Company. This RAG application showcases:

  1. Vector Search: Moving beyond keyword matching to semantic understanding
  2. Production-Ready: Elasticsearch Serverless handles scale and reliability
  3. Developer Experience: Simple API, powerful capabilities
  4. Real-World Use Cases: RAG is one of the most important AI applications today

Further Learning

Top comments (0)