Joshua Ezekiel

Posted on Feb 27 • Edited on Mar 3

Building a Vector-Powered AI Chat Assistant with Elasticsearch (RAG in Action)

#elasticblogathon #searchwithvectors #elasticdevdiaries

Blogathon Topic: Building a Vector-Powered AI Chat Assistant with Elasticsearch (RAG in Action)

Hey everyone! I’m Joshua Premkumar, a developer and tech enthusiast. I spend a lot of my time exploring the intersection of AI, machine learning, and intelligent search systems—basically, finding ways to combine modern AI tools with scalable platforms to solve actual, real-world problems.

Lately, I've been going down the rabbit hole of vector search and Retrieval-Augmented Generation (RAG) systems. There's no shortage of shiny new tools for building AI assistants right now, but I decided to revisit a classic: Elasticsearch. While we usually think of it for traditional search or log analysis, Elasticsearch has evolved to combine its rock-solid keyword search with incredibly powerful vector-based semantic retrieval.

In this post, I want to share my recent exploration into building a simple, context-aware AI chat assistant using Elasticsearch as the vector database.

The Shift from Keywords to "Vectorized Thinking"

If you've ever built a search feature, you know that traditional search engines rely heavily on keyword matching. This is fantastic for structured, predictable queries, but it falls apart the moment a user asks a conversational question.

Let’s say a user searches: "How can companies reduce fuel consumption in aircraft engines?"

A rigid keyword-based system is going to look for those exact words. It might completely miss highly relevant documents that use phrases like "improving engine efficiency," "reducing fuel burn," or "optimizing aircraft performance."

Vector search solves this by converting text into numerical embeddings. Instead of matching literal words, these embeddings capture the semantic meaning of the text. This is what I like to call vectorized thinking—teaching the system to understand the user's intent rather than just playing a game of word-matching.

Enter RAG: Giving AI a Memory

Modern LLMs are incredibly smart, but they have a fatal flaw: they only know what they were trained on. If you ask them about your private company data or highly specific recent documents, they either hallucinate or hit a wall.

Retrieval-Augmented Generation (RAG) bridges this gap by combining an information retrieval system with a language generation model. Instead of letting the AI fly blind, a RAG pipeline fetches relevant documents first and feeds them to the LLM as context before it generates an answer.

The workflow is beautifully simple:
User Query → Create Query Embedding → Vector Search in DB → Retrieve Relevant Docs → Send Context to LLM → Generate Answer

This architecture dramatically improves both the accuracy and the trustworthiness of the AI's responses.

Why Use Elasticsearch for This?

You might be wondering, "Why Elasticsearch when there are dedicated vector databases?"

Elasticsearch is famous for log analysis and observability, but its newer versions support dense vector fields natively. This brings some massive advantages to the table:

Hybrid Search:

This is the killer feature. Elastic lets you combine traditional BM25 keyword search with vector similarity search. Sometimes you need the semantic meaning, but sometimes the user is searching for an exact product SKU or ID. Hybrid search gives you the best of both worlds.

Battle-Tested Scalability:

It’s built for large-scale data systems. If you need to handle millions of indexed documents, Elasticsearch won't break a sweat.

Ecosystem Compatibility:

It plays incredibly well with modern AI stacks, integrating smoothly with Python, LangChain, LlamaIndex, HuggingFace, and OpenAI APIs.

Building the Assistant: Step-by-Step

Let's look at the architecture I used for my prototype. Think of Elasticsearch as the "memory layer" for the AI assistant. Here is how you can put it together.

Step 1: Generating Document Embeddings

Before we can search semantically, we need to convert our documents into vector embeddings. For this prototype, I used the excellent SentenceTransformers library in Python.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
"Elasticsearch supports vector search.",
"RAG systems combine retrieval and generation.",
"Vector embeddings capture semantic meaning."
]

Convert text into numerical vectors

embeddings = model.encode(documents)

Step 2: Prepping the Elasticsearch Index

Next, we need to tell Elasticsearch to expect vector data. We do this by creating an index with a dense_vector mapping.
PUT vector_documents
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"embedding": {
"type": "dense_vector",
"dims": 384
}
}
}
}
Note: The dims value matches the output dimension of our chosen SentenceTransformer model.

Step 3: Indexing the Documents
Now, we push our text and their corresponding embeddings into Elasticsearch.
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

doc = {
"content": "Vector search improves semantic understanding",
"embedding": embeddings[0].tolist() # Convert numpy array to list
}

es.index(index="vector_documents", document=doc)

Step 4: Performing the Vector Search

When a user asks a question, we run it through the exact same embedding model, and ask Elasticsearch to find the "nearest neighbors" (the most mathematically similar vectors).

1. Embed the user's question

query_embedding = model.encode("How does vector search work?")

2. Build the k-NN search query

search_query = {
"knn": {
"field": "embedding",
"query_vector": query_embedding,
"k": 3,
"num_candidates": 10
}
}

3. Execute the search

response = es.search(index="vector_documents", body=search_query)

Step 5: Generating the Response

Finally, we take the documents returned by Elasticsearch and hand them over to our LLM.

We structure the prompt something like this:

Context: [Insert the documents retrieved from Elasticsearch here]
Question: [Insert the user's original question here]
Answer: Because the LLM is reading from the provided context, the resulting answer is highly reliable. For example, if a user asks, "What role does vector search play in AI assistants?", the system successfully generates a response like:

"Vector search enables AI assistants to retrieve semantically relevant documents instead of relying only on keyword matching. This allows the assistant to understand user intent and provide more contextually accurate responses."

Where Do We Go From Here?

Once you have this pipeline set up, the potential applications are endless. You can use this exact architecture to build enterprise knowledge assistants, automate customer support, build semantic e-commerce search engines, or create research document retrieval systems.

The evolution of search is undeniably moving from keyword matching toward deep semantic understanding. By combining the mature, scalable infrastructure of Elasticsearch with the reasoning capabilities of large language models, we can build intelligent assistants that actually understand what our users are asking.

If you are building AI systems today, adopting "vectorized thinking" isn't just a nice-to-have anymore—it's essential for the next generation of knowledge systems.

Thanks for reading, and happy coding! Let me know in the comments if you've tried using Elasticsearch for your RAG pipelines!

“This blog post was submitted to the Elastic Blogathon Contest and is eligible to win a prize."

DEV Community