AUTHOR INTRO
I am Madhesh, a passionate developer with a strong interest in Agentic AI and DevOps. I enjoy learning new things, and I have always wanted to start writing blogs to connect with people. I chose to work on RAG because large language models (LLMs) are everywhere, and RAG adds significant power to them by providing proper context for user queries.
ABSTRACT
LLMs often hallucinate on domain-specific or recent data because they don’t have the proper context for user queries. Traditional LLM outputs rely solely on trained data, which may not contain up-to-date or domain-specific information. RAG overcomes these problems with strong retrieval pipelines. In this blog, I walk through designing and implementing a complete RAG pipeline using Elastic as the vector database. From ingesting documents to semantic retrieval and LLM augmentation, discover how Elastic’s vector capabilities deliver accurate, hallucination-resistant AI applications.
NAIVE SEARCH (KEYWORD SEARCH)
The naive way to search for relevant content in a document or database is by using a basic keyword search.
Example - search in a file:
grep "keyword" file.txt
Example - SQL keyword search in a database:
SELECT * FROM table_name WHERE column_name LIKE '%keyword%';
Keyword search works by finding exact matches. But if the user uses different words with the same meaning, keyword search fails. That is where semantic search and vector embeddings become useful.
TF-IDF
TF-IDF is a classic method to score how important a term is in a document relative to a corpus.
- TF (Term Frequency) looks at how many times a word appears in a specific document.
- DF (Document Frequency) is the number of documents where the word appears.
- IDF (Inverse Document Frequency) measures the importance of the word across the entire document set.
DF(t) = number of documents containing term t
IDF(t) = log(N / DF(t)), where N = total number of documents
TF-IDF weights terms that are frequent in a document but rare in the corpus, giving more relevant ranking than pure keyword counts.
BM25
BM25 is a ranking algorithm used in retrieval systems to determine the relevance of documents to a given user query. It is the default ranking algorithm used in systems like Elasticsearch and Whoosh. BM25 improves over TF-IDF by
- Normalizing for document length
- Saturating term frequency (more occurrences do not increase importance linearly)
- Producing better relevance scoring in practice
Compute BM25 in Python:
from rank_bm25 import BM25Okapi
docs = [
"machine learning is powerful",
"deep learning uses neural networks",
"machine learning and AI"
]
tokenized = [doc.split() for doc in docs]
bm25 = BM25Okapi(tokenized)
query = "machine learning".split()
scores = bm25.get_scores(query)
print(scores)
BM25 produces a score for each document based on the query and ranks them by relevance.
VECTOR EMBEDDINGS
When a user query uses a different word but similar meaning, keyword methods fail. This is where vector embeddings solve the problem.
Embeddings transform text into numerical vectors that capture semantic meaning. Similar texts have vectors close to each other in vector space.
Generate embeddings:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["machine learning", "deep learning"]
vectors = model.encode(texts)
print(vectors.shape) # (2, 384)
INTRO TO RAG PIPELINES
A RAG pipeline consists of several stages. The process of document ingestion occurs, and when an online query comes in, the retrieval of relevant documents and the generation of a response occur. Then, with the context it has, it augments and starts to generate an accurate response.
RELEVANT CONTEXT AND PREPROCESSING
First, ingest raw data into the RAG system. To make it effective, choose proper preprocessing techniques:
Chunking
Chunking breaks large documents into smaller pieces that are easier to index and retrieve. Good chunking balances context with retrieval efficiency.
VECTOR DATABASE
Once text is chunked and embedded into vectors, store it in a vector database (e.g., Elasticsearch). The vector DB stores embeddings and performs similarity search to match user queries with relevant chunks.
ELASTICSEARCH – SETUP & CODE
1. Create index with vector field
curl -X PUT "localhost:9200/docs" -H "Content-Type: application/json" -d '
{
"mappings": {
"properties": {
"text": { "type": "text" },
"vector": { "type": "dense_vector", "dims": 384 }
}
}
}'
2. Insert document with embedding
curl -X POST "localhost:9200/docs/_doc" -H "Content-Type: application/json" -d '
{
"text": "machine learning is powerful",
"vector": [0.12, -0.93, ...] # real embedding vector
}'
3. Query using BM25 (keyword search)
curl -X GET "localhost:9200/docs/_search" -H "Content-Type: application/json" -d '
{
"query": {
"match": {
"text": "machine learning"
}
}
}'
4. Query using Vector Similarity
curl -X GET "localhost:9200/docs/_search" -H "Content-Type: application/json" -d '
{
"knn": {
"field": "vector",
"query_vector": [0.12, -0.93, ...],
"k": 3,
"num_candidates": 10
}
}'
5. Hybrid Search (BM25 + Vector)
curl -X GET "localhost:9200/docs/_search" -H "Content-Type: application/json" -d '
{
"query": {
"bool": {
"should": [
{ "match": { "text": "machine learning" }},
{
"knn": {
"field": "vector",
"query_vector": [0.12, -0.93, ...],
"k": 3,
"num_candidates": 10
}
}
]
}
}
}'
Hybrid search combines keyword ranking (BM25) and semantic ranking (vector similarity).
RERANKING
Reranking is a post-processing step that improves result relevance by applying stronger scoring methods. It considers semantic relevance and similarity to reorder results for better quality. Reranking is more computationally expensive and is usually applied only to top results.
INTEGRATING ELASTIC WITH LLMS
Elastic can serve as the retrieval backend for a RAG system. When a user query arrives:
- The query is embedded(converted to vector embeddings).
- Elastic retrieves the most similar chunks (vector search).
- The retrieved chunks are passed to the LLM.
- The LLM generates an answer grounded in retrieved context.
This integration reduces hallucination and increases response accuracy.
PRODUCTION INSIGHT
When building a RAG pipeline, most developers focus heavily on the LLM and ignore the retrieval layer. In practice, retrieval quality matters more than model size. If the retriever returns irrelevant chunks, even the best LLM will confidently generate incorrect answers. I realized this while experimenting with chunk sizes and indexing strategies and small changes in chunking and overlap significantly changed answer quality.
Another important point is that hybrid search often performs better than pure vector search. Vector similarity is powerful for semantic understanding, but keyword signals still matter in production. In many cases, combining BM25 with vector search improved precision and reduced noise. Reranking also made a visible difference, especially when the initial retrieval returned loosely related results.
Latency is another real-world factor that is often underestimated. Running embeddings, querying vectors, reranking, and then calling an LLM adds up quickly. In production systems, you must balance accuracy with response time. Tuning the top-K retrieval size, embedding model selection, and reranking depth directly impacts both performance and cost.
Finally, data freshness matters. RAG systems must support continuous indexing. If documents are not updated properly, the system becomes stale and starts returning outdated context. In production, retrieval pipelines must be monitored just like any other backend service.
DEPLOY RAG MODELS ON CLOUD
Elastic Cloud provides a fully managed Elasticsearch environment with built-in scaling, security, and monitoring. Instead of managing nodes, shard allocation, replication, and cluster health manually, Elastic Cloud handles infrastructure operations. This allows developers to focus on indexing documents, embedding pipelines, hybrid retrieval, and LLM integration rather than maintaining search infrastructure.
For a RAG pipeline, Elastic Cloud supports:
- Dense vector fields for storing embeddings
- kNN vector search for semantic retrieval
- BM25-based keyword search
- Hybrid search combining lexical and semantic signals
- Secure deployment with role-based access control
A production-ready RAG architecture on Elastic Cloud typically includes:
- An embedding model (self-hosted or API-based)
- An Elastic Cloud deployment with vector-enabled indices
- A backend service that performs retrieval and prompt construction
- An LLM provider for generation
- Monitoring via Elastic Stack (metrics, logs, performance tracking)
As embeddings scale into millions of vectors, cluster sizing becomes critical. Elastic Cloud allows vertical and horizontal scaling by adjusting node size and instance count without downtime. This is essential when handling increasing search traffic or expanding document collections.
Security is also a major factor. Elastic Cloud provides TLS encryption, API keys, and access controls out of the box. In AI applications dealing with private documents or enterprise data, this becomes non-negotiable.
In real-world systems, RAG is not only about retrieval and generation quality. It is about cluster stability, index performance, scaling strategy, and operational visibility. Elastic Cloud provides the infrastructure layer that makes large-scale RAG systems stable, secure, and production-ready.
CONCLUSION
Engineers can over-engineer things. The true value of RAG lies in strengthening LLM responses with real context from scalable systems like Elasticsearch. RAG makes LLMs less prone to hallucination and vastly improves relevance and accuracy.
If neither step 1 (retrieval) nor step 2 (generation) gives high-quality results, then consider improving both parts of a RAG pipeline and the retrieval components.
Project Repository:
Github on RAG
Note: The content of this blog is fully organic. AI was utilized solely for grammatical error correction and Structural alignment.

Top comments (0)