Ansuj Kumar Meher

Posted on Feb 28 • Edited on Mar 5

Vectorized Conversations: Building a Quick RAG Chat Assistant Using Elasticsearch as a Vector Database

#vectorswithelastic #searchwithvectors #writewithelastic #storiesinsearch

This blog post was submitted to the Elastic Blogathon Contest and is eligible to win a prize.

Author Introduction

Hi, I’m Ansuj Kumar Meher, a developer deeply interested in search systems, distributed architecture, and AI-driven applications. Over the past few months, I’ve been exploring how vector search can transform traditional retrieval systems. For the Elastic Blogathon 2026, I wanted to move beyond theory and build something practical — a fully working Hybrid RAG assistant powered by Elasticsearch on Elastic Cloud.

This blog documents that journey — including architecture decisions, implementation details, and real observations from building the system.

Abstract

Large language models are impressive, but without grounding, they hallucinate. In this blog, I demonstrate how to build a Hybrid Retrieval-Augmented Generation (RAG) system using Elasticsearch as a vector database on Elastic Cloud. By combining BM25 keyword search, HNSW-based vector similarity, and Gemini 2.5 Flash for generation, we create a scalable and production-ready semantic search assistant.

1. The Problem: LLMs Without Grounding

When I first built a chatbot using an LLM, it felt magical. It could answer questions fluently and summarize information beautifully.

But then I asked it something outside its knowledge scope.

It still answered — confidently — and incorrectly.

That’s when I realized:

LLMs are great language generators.

They are not reliable retrieval systems.

If we want trustworthy AI systems, they must retrieve real information before generating answers.

This is where Retrieval-Augmented Generation (RAG) becomes essential.

2. Why Hybrid Search Instead of Just Vector Search?

While building this project, I tested three approaches:

Keyword-only search (BM25)
Vector-only search
Hybrid search (BM25 + vector similarity)

Each had strengths and weaknesses.

Keyword Search (BM25)

Excellent at lexical precision
Struggles with semantic intent

Vector Search

Understands meaning
Sometimes ignores important keywords

Hybrid Search

Balances precision and semantic understanding
Produces more stable ranking

For RAG systems, retrieval quality determines generation quality. Hybrid search consistently produced better grounding.

That’s why I built the system using Elasticsearch’s hybrid search capabilities.

3. System Architecture

The flow is simple but powerful:

User submits a query.
Query is converted into a 384-dimensional embedding using MiniLM.
Elasticsearch performs hybrid retrieval:
- BM25 keyword match
- kNN vector search using HNSW + cosine similarity
Top documents are retrieved.
Retrieved context is injected into the LLM prompt.
Gemini generates a grounded response.

The key idea is that the LLM never answers without context.

4. Deploying Elasticsearch on Elastic Cloud

Instead of running Elasticsearch locally, I deployed it on Elastic Cloud to simulate a production-ready environment.

Why Elastic Cloud?

Managed cluster
Built-in security
Automatic scaling
Production-grade infrastructure
Native vector search support

This ensures the architecture reflects real-world deployment patterns, not just a demo setup.

5. Configuring Elasticsearch as a Vector Database

To enable vector search, I created an index with a dense_vector field:

PUT chat_index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

Key points:

dims: 384 matches MiniLM embedding output.
similarity: cosine aligns with semantic embedding comparison.
index: true enables HNSW approximate nearest neighbor search.

This transforms Elasticsearch into a scalable vector database.

6. Understanding HNSW and Why It Matters

One challenge in vector search is performance.

Naively comparing a query vector against every stored vector is computationally expensive.

Elasticsearch solves this using HNSW (Hierarchical Navigable Small World graphs).

In simple terms:

Vectors are organized into graph layers.
Search navigates these layers efficiently.
Retrieval becomes fast even at scale.

This makes hybrid search practical for large datasets and production systems.

7. Implementing Hybrid Search

Here is the hybrid query I used:

GET chat_index/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "content": "Elasticsearch search"
        }
      }
    }
  },
  "knn": {
    "field": "embedding",
    "query_vector": [ ... 384 values ... ],
    "k": 3,
    "num_candidates": 10
  }
}

The _score reflects combined lexical and semantic relevance.

In my testing, hybrid retrieval produced more balanced and reliable results compared to vector-only search.

During experimentation, I also observed how tuning k and num_candidates impacted performance. Increasing num_candidates improved recall by exploring more potential nearest neighbors, but slightly increased latency. Similarly, raising k provided broader context for generation, but too many retrieved documents sometimes diluted answer precision. For small datasets this tradeoff is minimal, but at production scale, ANN parameter tuning becomes critical for balancing speed and retrieval quality.

8. Building the RAG Pipeline in Python

The full implementation is available here:

GitHub Repository:

https://github.com/ANSUJKMEHER/RagChat

Step 1: Generate Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
query_embedding = model.encode(query).tolist()

Step 2: Hybrid Retrieval

response = es.search(
    index="chat_index",
    query={
        "bool": {
            "must": {
                "match": {"content": query}
            }
        }
    },
    knn={
        "field": "embedding",
        "query_vector": query_embedding,
        "k": 3,
        "num_candidates": 10
    }
)

Step 3: Construct Context

retrieved_docs = [hit["_source"]["content"] for hit in response["hits"]["hits"]]
context = "\n".join(retrieved_docs)

This context is passed into the LLM prompt, ensuring grounded generation.

9. Integrating Gemini 2.5 Flash

import requests
import os

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

url = f"https://generativelanguage.googleapis.com/v1/models/gemini-2.5-flash:generateContent?key={GEMINI_API_KEY}"

payload = {
    "contents": [
        {"parts": [{"text": final_prompt}]}
    ]
}

response = requests.post(url, json=payload)
result = response.json()

The final answer now reflects retrieved Elastic documents rather than hallucinated content.

10. Sample Output

Example Query:

How does Elasticsearch improve search?

Example Output:

Supports hybrid search combining BM25 and vector similarity
Enables semantic similarity using dense vector embeddings
Scales efficiently using distributed architecture

11. Practical Applications

This architecture applies directly to:

Enterprise knowledge assistants
AI customer support bots
E-commerce semantic search
Log analysis in observability platforms
AI copilots grounded in proprietary data

For example, in an enterprise knowledge base, hybrid search prevents LLMs from fabricating internal policy details by forcing responses to rely strictly on indexed documentation. This significantly reduces hallucination risk while maintaining conversational fluency.

Hybrid search ensures meaning and precision coexist.

12. Key Observations from Building This

While building and testing this system, I observed:

Hybrid retrieval significantly improves answer grounding.
Retrieval quality impacts LLM output more than generation parameters.
Elastic Cloud simplifies scaling concerns.
Even small improvements in retrieval ranking dramatically improve answer quality.

One important realization:

In RAG systems, retrieval matters more than generation.

Conclusion + Takeaways

Vectorized thinking is not about replacing keyword search.

It is about enhancing it.

By combining:

Dense vector indexing
Hybrid search
HNSW-based ANN
Elastic Cloud deployment
RAG architecture

We create AI systems that are:

Reliable
Scalable
Context-aware
Production-ready

What surprised me most during this implementation was how much retrieval quality shapes generation quality. Adjusting embedding strategy and hybrid scoring had a larger impact on answer correctness than tweaking LLM temperature or prompt structure. This reinforced an important lesson: in production RAG systems, search engineering is not optional — it is foundational.

Elasticsearch demonstrates that search and vectors do not compete — they complement each other.

This project is a small step toward building grounded, trustworthy AI systems powered by Elastic.

GitHub Repository

Full source code available at:

https://github.com/ANSUJKMEHER/RagChat

DEV Community