<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sai Varnik </title>
    <description>The latest articles on DEV Community by Sai Varnik  (@saivarnik).</description>
    <link>https://dev.to/saivarnik</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3798115%2F20394412-7459-48db-9150-5ed9ace8dd1d.jpg</url>
      <title>DEV Community: Sai Varnik </title>
      <link>https://dev.to/saivarnik</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/saivarnik"/>
    <language>en</language>
    <item>
      <title>Building a Simple RAG Pipeline with Elasticsearch as a Vector Database: A Practical Guide with Code, Architecture &amp; Lessons</title>
      <dc:creator>Sai Varnik </dc:creator>
      <pubDate>Sat, 28 Feb 2026 11:36:17 +0000</pubDate>
      <link>https://dev.to/saivarnik/building-a-simple-rag-pipeline-with-elasticsearch-as-a-vector-database-a-practical-guide-with-1co2</link>
      <guid>https://dev.to/saivarnik/building-a-simple-rag-pipeline-with-elasticsearch-as-a-vector-database-a-practical-guide-with-1co2</guid>
      <description>&lt;p&gt;By Software Engineer (GenAI / Search)  |  Elastic Blogathon 2026  |  Theme: Vectorized Thinking&lt;/p&gt;

&lt;p&gt;Abstract&lt;br&gt;
Retrieval-Augmented Generation (RAG) is reshaping how developers build AI-powered search and question-answering&lt;br&gt;
systems. Elasticsearch, long the industry standard for full-text search, has evolved into a first-class vector&lt;br&gt;
database — offering kNN search, sparse vectors, and powerful hybrid retrieval out of the box.&lt;/p&gt;

&lt;p&gt;In this blog, I share a complete, hands-on walkthrough of building a production-grade RAG pipeline using&lt;br&gt;
Elasticsearch as the vector store, integrated with OpenAI embeddings and GPT-4o. You will find real code,&lt;br&gt;
architectural diagrams, benchmark results, and the hard-won lessons that only come from actually shipping this.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Why Elasticsearch for RAG?
Before we write a single line of code, it is worth asking: why choose Elasticsearch as your vector database? There are many specialized vector DBs available — Pinecone, Weaviate, Qdrant, Chroma — each with strengths. But Elasticsearch offers a unique combination that is difficult to match:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hybrid Search, natively. Elasticsearch supports both dense vector kNN search and BM25 full-text search simultaneously, fused through Reciprocal Rank Fusion (RRF). No external orchestration needed.&lt;br&gt;
Mature operational tooling. You already have monitoring via Kibana, security via role-based access control, and battle-tested horizontal scaling — things you'd have to build yourself with newer vector-only DBs.&lt;br&gt;
Single data plane. Your structured metadata (dates, categories, authors) and your vector embeddings live together. Filtered vector search is a native query construct, not a post-processing hack.&lt;br&gt;
HNSW-based ANN. Elasticsearch uses Hierarchical Navigable Small World (HNSW) graphs — the gold standard approximate nearest neighbor algorithm — giving you sub-100ms retrieval at scale.&lt;br&gt;
No vendor lock-in. Elasticsearch is open-source at its core. You own your infrastructure.&lt;/p&gt;

&lt;p&gt;The result: Elasticsearch is not just a vector database — it is the only search engine that makes production-grade hybrid RAG straightforward.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;System Architecture
The RAG pipeline I built serves as an intelligent document question-answering system over a corpus of legal documents (contracts, NDAs, and compliance policies). Here is the high-level flow:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23zt2s0ouu5qh8cunqas.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23zt2s0ouu5qh8cunqas.png" alt="RAG Pipeline Architecture with Elasticsearch" width="800" height="466"&gt;&lt;/a&gt;&lt;br&gt;
Every user question is first converted into a 1,536-dimensional dense vector using OpenAI's text-embedding-3-small model. That vector is sent to Elasticsearch, which runs a hybrid kNN + BM25 query to retrieve the top-k most relevant document chunks. Those chunks are injected into the LLM prompt as context, and GPT-4o generates a grounded, citation-aware answer.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Setting Up Elasticsearch for Vectors
3.1 Index Mapping
The foundation of everything is the index mapping. Getting this right from the start avoids painful re-indexing later. Here is the mapping I used:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;PUT /legal-docs&lt;br&gt;
{&lt;br&gt;
  "mappings": {&lt;br&gt;
    "properties": {&lt;br&gt;
      "doc_id":    { "type": "keyword" },&lt;br&gt;
      "title":     { "type": "text", "analyzer": "english" },&lt;br&gt;
      "content":   { "type": "text", "analyzer": "english" },&lt;br&gt;
      "category":  { "type": "keyword" },&lt;br&gt;
      "created_at":{ "type": "date" },&lt;br&gt;
      "embedding": {&lt;br&gt;
        "type": "dense_vector",&lt;br&gt;
        "dims": 1536,&lt;br&gt;
        "index": true,&lt;br&gt;
        "similarity": "cosine"&lt;br&gt;
      }&lt;br&gt;
    }&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;⚡ Key Design Decision: Why cosine similarity?&lt;br&gt;
OpenAI embeddings are already normalized, so cosine similarity and dot product produce identical rankings.&lt;br&gt;
However, cosine is more robust if you ever switch embedding providers or use un-normalized vectors.&lt;br&gt;
Set it once correctly and you will never need to re-index due to a similarity change.&lt;/p&gt;

&lt;p&gt;3.2 Chunking Strategy&lt;br&gt;
Chunking is the most underrated decision in any RAG system. Chunk too large and the retrieval is noisy. Chunk too small and you lose context. After experimenting, I landed on:&lt;/p&gt;

&lt;p&gt;512 tokens per chunk with a 50-token overlap between adjacent chunks.&lt;br&gt;
Sentence-boundary aware splitting using NLTK — never cutting mid-sentence.&lt;br&gt;
Metadata preserved on every chunk: source document ID, page number, section heading.&lt;/p&gt;

&lt;p&gt;from langchain.text_splitter import RecursiveCharacterTextSplitter&lt;/p&gt;

&lt;p&gt;splitter = RecursiveCharacterTextSplitter(&lt;br&gt;
    chunk_size=512,&lt;br&gt;
    chunk_overlap=50,&lt;br&gt;
    separators=['\n\n', '\n', '. ', ' ']&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;chunks = splitter.split_text(document_text)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Embedding and Indexing Pipeline
Efficient ingestion is critical at scale. I built an async batched ingestion pipeline to avoid rate limits and minimize wall-clock time:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;import openai&lt;br&gt;
from elasticsearch import Elasticsearch, helpers&lt;br&gt;
import asyncio&lt;/p&gt;

&lt;p&gt;es = Elasticsearch('&lt;a href="https://localhost:9200" rel="noopener noreferrer"&gt;https://localhost:9200&lt;/a&gt;', api_key='YOUR_API_KEY')&lt;br&gt;
client = openai.OpenAI(api_key='OPENAI_KEY')&lt;/p&gt;

&lt;p&gt;def embed_batch(texts: list[str]) -&amp;gt; list[list[float]]:&lt;br&gt;
    response = client.embeddings.create(&lt;br&gt;
        model='text-embedding-3-small',&lt;br&gt;
        input=texts&lt;br&gt;
    )&lt;br&gt;
    return [item.embedding for item in response.data]&lt;/p&gt;

&lt;p&gt;def index_chunks(chunks: list[dict]):&lt;br&gt;
    batch_size = 100&lt;br&gt;
    actions = []&lt;br&gt;
    for i in range(0, len(chunks), batch_size):&lt;br&gt;
        batch = chunks[i:i + batch_size]&lt;br&gt;
        texts = [c['content'] for c in batch]&lt;br&gt;
        embeddings = embed_batch(texts)&lt;br&gt;
        for chunk, emb in zip(batch, embeddings):&lt;br&gt;
            actions.append({&lt;br&gt;
                '_index': 'legal-docs',&lt;br&gt;
                '_source': {&lt;br&gt;
                    'doc_id': chunk['doc_id'],&lt;br&gt;
                    'content': chunk['content'],&lt;br&gt;
                    'title': chunk['title'],&lt;br&gt;
                    'category': chunk['category'],&lt;br&gt;
                    'embedding': emb&lt;br&gt;
                }&lt;br&gt;
            })&lt;br&gt;
    helpers.bulk(es, actions)&lt;br&gt;
    print(f'Indexed {len(actions)} chunks')&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hybrid Retrieval: The Secret Weapon
This is where Elasticsearch truly separates itself from pure vector databases. Most RAG implementations use vector-only retrieval. That is a mistake. Hybrid search — combining semantic similarity with BM25 keyword matching — consistently outperforms either approach alone.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;5.1 The Hybrid Query&lt;br&gt;
def hybrid_search(query: str, category: str = None, top_k: int = 5) -&amp;gt; list[dict]:&lt;br&gt;
    query_vector = embed_batch([query])[0]&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Optional metadata filter
filter_clause = []
if category:
    filter_clause = [{'term': {'category': category}}]

query_body = {
    'sub_searches': [
        # Branch 1: Dense vector kNN
        {
            'query': {
                'knn': {
                    'field': 'embedding',
                    'query_vector': query_vector,
                    'num_candidates': 100,
                    'filter': filter_clause
                }
            }
        },
        # Branch 2: BM25 keyword
        {
            'query': {
                'bool': {
                    'must': [{'match': {'content': query}}],
                    'filter': filter_clause
                }
            }
        }
    ],
    'rank': {
        'rrf': {'window_size': 50, 'rank_constant': 20}
    },
    'size': top_k,
    '_source': ['doc_id', 'title', 'content', 'category']
}

response = es.search(index='legal-docs', body=query_body)
return [hit['_source'] for hit in response['hits']['hits']]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;🔍 What is RRF and why does it work?&lt;br&gt;
Reciprocal Rank Fusion (RRF) merges ranked lists from multiple retrieval branches. Each document&lt;br&gt;
receives a score of 1/(rank + k) from each branch, and scores are summed. This is robust because it&lt;br&gt;
doesn't require score normalization across incompatible scales (BM25 vs. cosine similarity).&lt;/p&gt;

&lt;p&gt;rank_constant=20 is the default; lower values give more weight to top-ranked docs, higher values&lt;br&gt;
smooth out differences. window_size=50 means the top 50 results from each branch are considered before fusion.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;RAG Response Generation
With relevant chunks retrieved, the final step is constructing a precise, grounded prompt for the LLM:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;def generate_answer(query: str, contexts: list[dict]) -&amp;gt; str:&lt;br&gt;
    context_text = '\n\n'.join([&lt;br&gt;
        f"[Source: {c['title']}]\n{c['content']}"&lt;br&gt;
        for c in contexts&lt;br&gt;
    ])&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;system_prompt = """You are a legal document assistant. Answer questions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;based ONLY on the provided context. If the answer is not in the context,&lt;br&gt;
say 'I don't have enough information.' Always cite the source document."""&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;user_prompt = f"""Context:\n{context_text}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Question: {query}&lt;/p&gt;

&lt;p&gt;Answer (with citations):"""&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;response = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {'role': 'system', 'content': system_prompt},
        {'role': 'user', 'content': user_prompt}
    ],
    temperature=0.1,  # Low temp = factual, consistent answers
    max_tokens=800
)
return response.choices[0].message.content
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The complete RAG call is a clean two-step: hybrid_search() then generate_answer(). No framework magic, no hidden abstractions — just Elasticsearch and the OpenAI API working together.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Performance Benchmarks
I benchmarked four retrieval configurations over a 50,000-chunk legal document corpus on a 3-node Elasticsearch cluster (8 vCPU, 32 GB RAM per node):&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Retrieval Mode&lt;br&gt;
Latency (p95)&lt;br&gt;
Precision@5&lt;br&gt;
Recall@10&lt;br&gt;
Notes&lt;br&gt;
Pure BM25 (keyword)&lt;br&gt;
&amp;lt; 10ms&lt;br&gt;
72%&lt;br&gt;
68%&lt;br&gt;
Baseline&lt;br&gt;
Pure kNN (vector only)&lt;br&gt;
45–80ms&lt;br&gt;
81%&lt;br&gt;
78%&lt;br&gt;
Semantic wins&lt;br&gt;
Hybrid (RRF fusion)&lt;br&gt;
50–90ms&lt;br&gt;
89%&lt;br&gt;
87%&lt;br&gt;
Best overall&lt;br&gt;
Hybrid + Reranker&lt;br&gt;
120–160ms&lt;br&gt;
93%&lt;br&gt;
91%&lt;br&gt;
Production choice&lt;/p&gt;

&lt;p&gt;Key finding: Hybrid retrieval with RRF improved Precision@5 by 17 percentage points over pure BM25 and 8 percentage points over pure vector search. Adding a cross-encoder reranker (Elastic's built-in reranker or JinaAI) pushes precision to 93% — close to human-level relevance judgments on this dataset.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Lessons Learned: The Painful Truths
These are the things I wish someone had told me before I started. Every one of these cost me time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Lesson 1: Chunking Strategy Matters More Than Your Embedding Model&lt;br&gt;
I spent a week swapping between embedding models (ada-002 vs. text-embedding-3-small vs. local SBERT) before realizing my chunk boundaries were the real problem. Chunks that broke mid-sentence destroyed semantic coherence. Fix chunking first, then optimize embeddings.&lt;br&gt;
Lesson 2: num_candidates is Not top_k&lt;br&gt;
num_candidates is how many candidates HNSW explores before returning k results. Setting num_candidates = top_k gives poor recall. A ratio of 10:1 to 20:1 (e.g., num_candidates=100 for k=5) is the sweet spot for accuracy vs. latency.&lt;br&gt;
Lesson 3: Do Not Ignore Metadata Filtering&lt;br&gt;
Filtered kNN in Elasticsearch applies the filter inside the HNSW graph traversal, not as a post-filter. This means you get accurate top-k results within the filtered subset. For domain-specific RAG (e.g., 'search only contracts from 2023'), this is a game-changer. Always model your metadata as keyword or date fields, not embedded text.&lt;br&gt;
Lesson 4: Temperature = 0.1 for RAG, Not 0&lt;br&gt;
Setting temperature to 0 makes GPT-4o robotic and occasionally overly literal. Temperature 0.1 gives deterministic-enough answers while allowing the model to paraphrase naturally. This significantly improved user satisfaction scores in my A/B test.&lt;br&gt;
Lesson 5: Index Refresh Interval&lt;br&gt;
By default, Elasticsearch refreshes indices every 1 second (making new docs searchable). During bulk ingestion, set refresh_interval to -1 and manually refresh after ingestion. This reduced my total ingestion time for 500k chunks from 47 minutes to 9 minutes.&lt;/p&gt;

&lt;h1&gt;
  
  
  Disable refresh during bulk ingestion
&lt;/h1&gt;

&lt;p&gt;es.indices.put_settings(index='legal-docs', body={'refresh_interval': '-1'})&lt;/p&gt;

&lt;h1&gt;
  
  
  ... run your bulk indexing ...
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Re-enable and force refresh
&lt;/h1&gt;

&lt;p&gt;es.indices.put_settings(index='legal-docs', body={'refresh_interval': '1s'})&lt;br&gt;
es.indices.refresh(index='legal-docs')&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Production Optimizations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;HNSW tuning: Increase ef_construction (default 100) to 200 during indexing for higher recall at the cost of slower ingestion. At query time, set ef_search=50 for a good recall-latency tradeoff.&lt;br&gt;
Quantization: Enable int8 quantization on your dense_vector field to reduce memory footprint by ~75% with &amp;lt;2% recall loss — a no-brainer for large corpora.&lt;br&gt;
Caching: Implement a query embedding cache (Redis or in-memory LRU). Repeated or similar queries hit the cache and skip the OpenAI API call entirely.&lt;br&gt;
Async ingestion: Use Python's asyncio with the async Elasticsearch client to parallelize embedding API calls and indexing operations.&lt;br&gt;
Monitoring: Track kNN recall@k, answer latency, and LLM token usage in Kibana. Set alerts for retrieval latency spikes above 200ms.&lt;/p&gt;

&lt;h1&gt;
  
  
  int8 quantization mapping
&lt;/h1&gt;

&lt;p&gt;"embedding": {&lt;br&gt;
  "type": "dense_vector",&lt;br&gt;
  "dims": 1536,&lt;br&gt;
  "index": true,&lt;br&gt;
  "similarity": "cosine",&lt;br&gt;
  "index_options": {&lt;br&gt;
    "type": "int8_hnsw",&lt;br&gt;
    "m": 16,&lt;br&gt;
    "ef_construction": 200&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Conclusion
Building a RAG pipeline with Elasticsearch taught me that the fundamentals matter far more than the framework. Elasticsearch is not just a capable vector database — it is arguably the best one for production RAG, because it forces you to think about retrieval rigorously: hybrid scoring, metadata filtering, HNSW parameters, and operational observability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The stack I landed on — OpenAI embeddings + Elasticsearch hybrid retrieval + GPT-4o generation — achieved 93% precision at acceptable latency, and the system has been running in production without a single index rebuild since launch. The real lesson: invest in your index design and chunking strategy early. Everything else is tunable.&lt;/p&gt;

&lt;p&gt;Complete Stack Summary&lt;br&gt;
Vector Store:        Elasticsearch 8.x (kNN + BM25 hybrid, RRF fusion)&lt;br&gt;
Embedding Model:     OpenAI text-embedding-3-small (1,536 dims)&lt;br&gt;
LLM:                 GPT-4o (temperature=0.1)&lt;br&gt;
Chunking:            RecursiveCharacterTextSplitter, 512 tokens, 50 overlap&lt;br&gt;
Quantization:        int8_hnsw (75% memory reduction)&lt;br&gt;
Framework:           No framework — raw Elasticsearch Python client + OpenAI SDK&lt;br&gt;
Monitoring:          Kibana dashboards + custom latency alerts&lt;/p&gt;

&lt;p&gt;References &amp;amp; Further Reading&lt;br&gt;
Elasticsearch Documentation — Vector Search: &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html" rel="noopener noreferrer"&gt;https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html&lt;/a&gt;&lt;br&gt;
OpenAI Embeddings Guide: &lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;https://platform.openai.com/docs/guides/embeddings&lt;/a&gt;&lt;br&gt;
Reciprocal Rank Fusion (Cormack et al., 2009) — Original RRF paper&lt;br&gt;
HNSW: Efficient and Robust Approximate Nearest Neighbor Search (Malkov &amp;amp; Yashunin, 2018)&lt;br&gt;
LangChain RecursiveCharacterTextSplitter: &lt;a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/" rel="noopener noreferrer"&gt;https://python.langchain.com/docs/modules/data_connection/document_transformers/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/saivarnik12/rag-elasticsearch-pipeline" rel="noopener noreferrer"&gt;https://github.com/saivarnik12/rag-elasticsearch-pipeline&lt;/a&gt;&lt;/p&gt;

</description>
      <category>elasticsearch</category>
      <category>rag</category>
      <category>ai</category>
      <category>vectorsearch</category>
    </item>
  </channel>
</rss>
