Author Introduction
Hi, I’m Ansuj Kumar Meher, a developer deeply interested in search systems, distributed architecture, and AI-driven applications. Over the past few months, I’ve been exploring how vector search can transform traditional retrieval systems. For the Elastic Blogathon 2026, I wanted to move beyond theory and build something practical — a fully working Hybrid RAG assistant powered by Elasticsearch on Elastic Cloud.
This blog documents that journey — including architecture decisions, implementation details, and real observations from building the system.
Abstract
Large language models are impressive, but without grounding, they hallucinate. In this blog, I demonstrate how to build a Hybrid Retrieval-Augmented Generation (RAG) system using Elasticsearch as a vector database on Elastic Cloud. By combining BM25 keyword search, HNSW-based vector similarity, and Gemini 2.5 Flash for generation, we create a scalable and production-ready semantic search assistant.
1. The Problem: LLMs Without Grounding
When I first built a chatbot using an LLM, it felt magical. It could answer questions fluently and summarize information beautifully.
But then I asked it something outside its knowledge scope.
It still answered — confidently — and incorrectly.
That’s when I realized:
LLMs are great language generators.
They are not reliable retrieval systems.
If we want trustworthy AI systems, they must retrieve real information before generating answers.
This is where Retrieval-Augmented Generation (RAG) becomes essential.
2. Why Hybrid Search Instead of Just Vector Search?
While building this project, I tested three approaches:
- Keyword-only search (BM25)
- Vector-only search
- Hybrid search (BM25 + vector similarity)
Each had strengths and weaknesses.
Keyword Search (BM25)
- Excellent at lexical precision
- Struggles with semantic intent
Vector Search
- Understands meaning
- Sometimes ignores important keywords
Hybrid Search
- Balances precision and semantic understanding
- Produces more stable ranking
For RAG systems, retrieval quality determines generation quality. Hybrid search consistently produced better grounding.
That’s why I built the system using Elasticsearch’s hybrid search capabilities.
3. System Architecture
The flow is simple but powerful:
- User submits a query.
- Query is converted into a 384-dimensional embedding using MiniLM.
- Elasticsearch performs hybrid retrieval:
- BM25 keyword match
- kNN vector search using HNSW + cosine similarity
- Top documents are retrieved.
- Retrieved context is injected into the LLM prompt.
- Gemini generates a grounded response.
The key idea is that the LLM never answers without context.
4. Deploying Elasticsearch on Elastic Cloud
Instead of running Elasticsearch locally, I deployed it on Elastic Cloud to simulate a production-ready environment.
Why Elastic Cloud?
- Managed cluster
- Built-in security
- Automatic scaling
- Production-grade infrastructure
- Native vector search support
This ensures the architecture reflects real-world deployment patterns, not just a demo setup.
5. Configuring Elasticsearch as a Vector Database
To enable vector search, I created an index with a dense_vector field:
PUT chat_index
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"embedding": {
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine"
}
}
}
}
Key points:
-
dims: 384matches MiniLM embedding output. -
similarity: cosinealigns with semantic embedding comparison. -
index: trueenables HNSW approximate nearest neighbor search.
This transforms Elasticsearch into a scalable vector database.
6. Understanding HNSW and Why It Matters
One challenge in vector search is performance.
Naively comparing a query vector against every stored vector is computationally expensive.
Elasticsearch solves this using HNSW (Hierarchical Navigable Small World graphs).
In simple terms:
- Vectors are organized into graph layers.
- Search navigates these layers efficiently.
- Retrieval becomes fast even at scale.
This makes hybrid search practical for large datasets and production systems.
7. Implementing Hybrid Search
Here is the hybrid query I used:
GET chat_index/_search
{
"query": {
"bool": {
"must": {
"match": {
"content": "Elasticsearch search"
}
}
}
},
"knn": {
"field": "embedding",
"query_vector": [ ... 384 values ... ],
"k": 3,
"num_candidates": 10
}
}
The _score reflects combined lexical and semantic relevance.
In my testing, hybrid retrieval produced more balanced and reliable results compared to vector-only search.
During experimentation, I also observed how tuning k and num_candidates impacted performance. Increasing num_candidates improved recall by exploring more potential nearest neighbors, but slightly increased latency. Similarly, raising k provided broader context for generation, but too many retrieved documents sometimes diluted answer precision. For small datasets this tradeoff is minimal, but at production scale, ANN parameter tuning becomes critical for balancing speed and retrieval quality.
8. Building the RAG Pipeline in Python
The full implementation is available here:
GitHub Repository:
https://github.com/ANSUJKMEHER/RagChat
Step 1: Generate Embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
query_embedding = model.encode(query).tolist()
Step 2: Hybrid Retrieval
response = es.search(
index="chat_index",
query={
"bool": {
"must": {
"match": {"content": query}
}
}
},
knn={
"field": "embedding",
"query_vector": query_embedding,
"k": 3,
"num_candidates": 10
}
)
Step 3: Construct Context
retrieved_docs = [hit["_source"]["content"] for hit in response["hits"]["hits"]]
context = "\n".join(retrieved_docs)
This context is passed into the LLM prompt, ensuring grounded generation.
9. Integrating Gemini 2.5 Flash
import requests
import os
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
url = f"https://generativelanguage.googleapis.com/v1/models/gemini-2.5-flash:generateContent?key={GEMINI_API_KEY}"
payload = {
"contents": [
{"parts": [{"text": final_prompt}]}
]
}
response = requests.post(url, json=payload)
result = response.json()
The final answer now reflects retrieved Elastic documents rather than hallucinated content.
10. Sample Output
Example Query:
How does Elasticsearch improve search?
Example Output:
- Supports hybrid search combining BM25 and vector similarity
- Enables semantic similarity using dense vector embeddings
- Scales efficiently using distributed architecture
11. Practical Applications
This architecture applies directly to:
- Enterprise knowledge assistants
- AI customer support bots
- E-commerce semantic search
- Log analysis in observability platforms
- AI copilots grounded in proprietary data
For example, in an enterprise knowledge base, hybrid search prevents LLMs from fabricating internal policy details by forcing responses to rely strictly on indexed documentation. This significantly reduces hallucination risk while maintaining conversational fluency.
Hybrid search ensures meaning and precision coexist.
12. Key Observations from Building This
While building and testing this system, I observed:
- Hybrid retrieval significantly improves answer grounding.
- Retrieval quality impacts LLM output more than generation parameters.
- Elastic Cloud simplifies scaling concerns.
- Even small improvements in retrieval ranking dramatically improve answer quality.
One important realization:
In RAG systems, retrieval matters more than generation.
Conclusion + Takeaways
Vectorized thinking is not about replacing keyword search.
It is about enhancing it.
By combining:
- Dense vector indexing
- Hybrid search
- HNSW-based ANN
- Elastic Cloud deployment
- RAG architecture
We create AI systems that are:
- Reliable
- Scalable
- Context-aware
- Production-ready
What surprised me most during this implementation was how much retrieval quality shapes generation quality. Adjusting embedding strategy and hybrid scoring had a larger impact on answer correctness than tweaking LLM temperature or prompt structure. This reinforced an important lesson: in production RAG systems, search engineering is not optional — it is foundational.
Elasticsearch demonstrates that search and vectors do not compete — they complement each other.
This project is a small step toward building grounded, trustworthy AI systems powered by Elastic.
GitHub Repository
Full source code available at:
https://github.com/ANSUJKMEHER/RagChat






Top comments (0)