DEV Community

Cover image for Vectorized Conversations: Building a Quick RAG Chat Assistant Using Elasticsearch as a Vector Database
Ansuj Kumar Meher
Ansuj Kumar Meher

Posted on

Vectorized Conversations: Building a Quick RAG Chat Assistant Using Elasticsearch as a Vector Database

Author Introduction

Hi, I’m Ansuj Kumar Meher, a developer deeply interested in search systems, distributed architecture, and AI-driven applications. Over the past few months, I’ve been exploring how vector search can transform traditional retrieval systems. For the Elastic Blogathon 2026, I wanted to move beyond theory and build something practical — a fully working Hybrid RAG assistant powered by Elasticsearch on Elastic Cloud.

This blog documents that journey — including architecture decisions, implementation details, and real observations from building the system.


Abstract

Large language models are impressive, but without grounding, they hallucinate. In this blog, I demonstrate how to build a Hybrid Retrieval-Augmented Generation (RAG) system using Elasticsearch as a vector database on Elastic Cloud. By combining BM25 keyword search, HNSW-based vector similarity, and Gemini 2.5 Flash for generation, we create a scalable and production-ready semantic search assistant.


1. The Problem: LLMs Without Grounding

When I first built a chatbot using an LLM, it felt magical. It could answer questions fluently and summarize information beautifully.

But then I asked it something outside its knowledge scope.

It still answered — confidently — and incorrectly.

That’s when I realized:

LLMs are great language generators.

They are not reliable retrieval systems.

If we want trustworthy AI systems, they must retrieve real information before generating answers.

This is where Retrieval-Augmented Generation (RAG) becomes essential.


2. Why Hybrid Search Instead of Just Vector Search?

While building this project, I tested three approaches:

  1. Keyword-only search (BM25)
  2. Vector-only search
  3. Hybrid search (BM25 + vector similarity)

Each had strengths and weaknesses.

Keyword Search (BM25)

  • Excellent at lexical precision
  • Struggles with semantic intent

Vector Search

  • Understands meaning
  • Sometimes ignores important keywords

Hybrid Search

  • Balances precision and semantic understanding
  • Produces more stable ranking

For RAG systems, retrieval quality determines generation quality. Hybrid search consistently produced better grounding.

That’s why I built the system using Elasticsearch’s hybrid search capabilities.


3. System Architecture

Hybrid RAG architecture showing user query, embedding generation, Elasticsearch hybrid search, and LLM output

The flow is simple but powerful:

  1. User submits a query.
  2. Query is converted into a 384-dimensional embedding using MiniLM.
  3. Elasticsearch performs hybrid retrieval:
    • BM25 keyword match
    • kNN vector search using HNSW + cosine similarity
  4. Top documents are retrieved.
  5. Retrieved context is injected into the LLM prompt.
  6. Gemini generates a grounded response.

The key idea is that the LLM never answers without context.


4. Deploying Elasticsearch on Elastic Cloud

Instead of running Elasticsearch locally, I deployed it on Elastic Cloud to simulate a production-ready environment.

Elastic Cloud deployment dashboard showing healthy Elasticsearch cluster

Why Elastic Cloud?

  • Managed cluster
  • Built-in security
  • Automatic scaling
  • Production-grade infrastructure
  • Native vector search support

This ensures the architecture reflects real-world deployment patterns, not just a demo setup.


5. Configuring Elasticsearch as a Vector Database

To enable vector search, I created an index with a dense_vector field:

PUT chat_index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Elasticsearch index mapping displaying dense_vector field with 384 dimensions and cosine similarity

Key points:

  • dims: 384 matches MiniLM embedding output.
  • similarity: cosine aligns with semantic embedding comparison.
  • index: true enables HNSW approximate nearest neighbor search.

This transforms Elasticsearch into a scalable vector database.


6. Understanding HNSW and Why It Matters

One challenge in vector search is performance.

Naively comparing a query vector against every stored vector is computationally expensive.

Elasticsearch solves this using HNSW (Hierarchical Navigable Small World graphs).

In simple terms:

  • Vectors are organized into graph layers.
  • Search navigates these layers efficiently.
  • Retrieval becomes fast even at scale.

This makes hybrid search practical for large datasets and production systems.


7. Implementing Hybrid Search

Here is the hybrid query I used:

GET chat_index/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "content": "Elasticsearch search"
        }
      }
    }
  },
  "knn": {
    "field": "embedding",
    "query_vector": [ ... 384 values ... ],
    "k": 3,
    "num_candidates": 10
  }
}
Enter fullscreen mode Exit fullscreen mode

Elasticsearch Dev Tools console showing hybrid search query combining match and knn

The _score reflects combined lexical and semantic relevance.

In my testing, hybrid retrieval produced more balanced and reliable results compared to vector-only search.

During experimentation, I also observed how tuning k and num_candidates impacted performance. Increasing num_candidates improved recall by exploring more potential nearest neighbors, but slightly increased latency. Similarly, raising k provided broader context for generation, but too many retrieved documents sometimes diluted answer precision. For small datasets this tradeoff is minimal, but at production scale, ANN parameter tuning becomes critical for balancing speed and retrieval quality.


8. Building the RAG Pipeline in Python

The full implementation is available here:

GitHub Repository:

https://github.com/ANSUJKMEHER/RagChat

Step 1: Generate Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
query_embedding = model.encode(query).tolist()
Enter fullscreen mode Exit fullscreen mode

Step 2: Hybrid Retrieval

response = es.search(
    index="chat_index",
    query={
        "bool": {
            "must": {
                "match": {"content": query}
            }
        }
    },
    knn={
        "field": "embedding",
        "query_vector": query_embedding,
        "k": 3,
        "num_candidates": 10
    }
)
Enter fullscreen mode Exit fullscreen mode

Terminal output showing ranked hybrid search results with relevance scores


Step 3: Construct Context

retrieved_docs = [hit["_source"]["content"] for hit in response["hits"]["hits"]]
context = "\n".join(retrieved_docs)
Enter fullscreen mode Exit fullscreen mode

This context is passed into the LLM prompt, ensuring grounded generation.


9. Integrating Gemini 2.5 Flash

import requests
import os

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

url = f"https://generativelanguage.googleapis.com/v1/models/gemini-2.5-flash:generateContent?key={GEMINI_API_KEY}"

payload = {
    "contents": [
        {"parts": [{"text": final_prompt}]}
    ]
}

response = requests.post(url, json=payload)
result = response.json()
Enter fullscreen mode Exit fullscreen mode

Terminal output displaying final LLM-generated answer grounded in Elasticsearch retrieval

The final answer now reflects retrieved Elastic documents rather than hallucinated content.


10. Sample Output

Example Query:

How does Elasticsearch improve search?

Example Output:

  • Supports hybrid search combining BM25 and vector similarity
  • Enables semantic similarity using dense vector embeddings
  • Scales efficiently using distributed architecture

11. Practical Applications

This architecture applies directly to:

  • Enterprise knowledge assistants
  • AI customer support bots
  • E-commerce semantic search
  • Log analysis in observability platforms
  • AI copilots grounded in proprietary data

For example, in an enterprise knowledge base, hybrid search prevents LLMs from fabricating internal policy details by forcing responses to rely strictly on indexed documentation. This significantly reduces hallucination risk while maintaining conversational fluency.

Hybrid search ensures meaning and precision coexist.


12. Key Observations from Building This

While building and testing this system, I observed:

  • Hybrid retrieval significantly improves answer grounding.
  • Retrieval quality impacts LLM output more than generation parameters.
  • Elastic Cloud simplifies scaling concerns.
  • Even small improvements in retrieval ranking dramatically improve answer quality.

One important realization:

In RAG systems, retrieval matters more than generation.


Conclusion + Takeaways

Vectorized thinking is not about replacing keyword search.

It is about enhancing it.

By combining:

  • Dense vector indexing
  • Hybrid search
  • HNSW-based ANN
  • Elastic Cloud deployment
  • RAG architecture

We create AI systems that are:

  • Reliable
  • Scalable
  • Context-aware
  • Production-ready

What surprised me most during this implementation was how much retrieval quality shapes generation quality. Adjusting embedding strategy and hybrid scoring had a larger impact on answer correctness than tweaking LLM temperature or prompt structure. This reinforced an important lesson: in production RAG systems, search engineering is not optional — it is foundational.

Elasticsearch demonstrates that search and vectors do not compete — they complement each other.

This project is a small step toward building grounded, trustworthy AI systems powered by Elastic.


GitHub Repository

Full source code available at:

https://github.com/ANSUJKMEHER/RagChat


Top comments (0)