This blog post was submitted to the Elastic Blogathon Contest and is eligible to win a prize.
By a developer who spent two weekends yelling at embeddings
This is the story of how I tried to build a semantic search feature for a small internal knowledge base, failed in increasingly embarrassing ways, and eventually landed on a working RAG pipeline using Elasticsearch vector search and the Elastic Python client. No fluff. Just what actually happened, including the parts where I had no idea what I was doing.
How It Started (Badly)
It started, like most of my side projects, with a complaint. Our team had a shared Notion with maybe 200 pages of documentation, meeting notes, and random how-tos. Every time someone joined, they'd spend their first three days asking questions that were already answered somewhere in that mess. Classic problem. Obvious solution, right? Build a search thing. How hard could it be.
Famous last words.
I started with a dead simple keyword search. Exported everything to markdown, built a tiny Flask API, used Python's difflib to match queries. It worked exactly as well as you'd expect — which is to say, if someone typed the exact phrase from a document, great. If they asked "how do we handle customer refunds" and the doc said "returns policy", nothing. Zero. The search returned nothing and stared at me blankly.
So I went down the rabbit hole of semantic search. I'd heard about embeddings. I kind of understood what they were — dense vector representations of meaning, blah blah. I watched some YouTube videos. I felt very smart. Then I tried to actually build it.
Attempt One: The Naive Cosine Similarity Disaster
My first real attempt used sentence-transformers to embed all the documents locally, saved the vectors to a JSON file (yes, really), and then on each query, re-loaded everything and brute-forced cosine similarity across all 200 docs.
The code looked like this:
from sentence_transformers import SentenceTransformer
import json, numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
with open('docs.json') as f:
docs = json.load(f)
doc_embeddings = model.encode([d['text'] for d in docs])
def search(query):
q_emb = model.encode([query])[0]
scores = np.dot(doc_embeddings, q_emb)
return [docs[i] for i in np.argsort(scores)[::-1][:5]]
It worked! Kind of. The semantic matching was real — asking about "refunds" actually surfaced the "returns policy" doc. I was briefly very proud of myself.
Then I ran it in production for one week. Loading that JSON and re-encoding on every single query meant a 4-8 second response time. Users tried it twice and went back to Ctrl+F. The project quietly died.
Attempt Two: FAISS and the Indexing Nightmare
Okay, I told myself. I just need a proper vector index. I'll use FAISS. Facebook built it, it's fast, everyone uses it. Simple.
The setup was fine. Building the index worked. Querying was fast. But then came the operational reality: every time someone added a new doc to Notion, I had to manually re-export, re-embed, and rebuild the whole index. There was no update mechanism. There was no metadata filtering — if someone asked "show me only docs tagged onboarding", I'd have to query FAISS and then post-filter, which got messy fast. And when I tried to add BM25 text matching alongside it for a hybrid approach, I was basically reinventing a search engine from scratch.
A colleague looked at my weekend project and said, very diplomatically: "Have you considered just using Elasticsearch for this?"
I had not. I associated Elasticsearch with Big Enterprise Log Aggregation and thought it was overkill. I was wrong.
Attempt Three: Actually Using the Right Tool
I signed up for Elastic Cloud (they have a 14-day free trial, which is how this whole experiment stayed free). Spun up a cluster in about three minutes. Then I sat down and actually read the docs for once in my life.
The key thing I didn't realize: Elasticsearch has native dense_vector field support and kNN search built in. You don't need a separate vector database. You don't need FAISS on the side. You just define your index mapping to include a dense_vector field, store your embeddings there alongside your text and metadata, and query with kNN syntax. It handles approximate nearest neighbor search at scale, and it plays nicely with BM25 text search in the same query — which is called hybrid search, and it's the thing that makes RAG actually work well in practice.
Setting Up the Index
First I defined my index mapping with both text and vector fields:
from elasticsearch import Elasticsearch
es = Elasticsearch(cloud_id=CLOUD_ID, api_key=API_KEY)
mapping = {
"mappings": {
"properties": {
"title": { "type": "text" },
"content": { "type": "text" },
"tags": { "type": "keyword" },
"embedding": {
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine"
}
}
}
}
es.indices.create(index="knowledge-base", body=mapping)
Indexing the Documents
Then I wrote the indexing script. This runs once, and thereafter I just run it on any new or updated doc:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def index_doc(doc_id, title, content, tags):
embedding = model.encode(content).tolist()
es.index(index='knowledge-base', id=doc_id, document={
"title": title,
"content": content,
"tags": tags,
"embedding": embedding
})
The Hybrid Search Query
This is the part that actually surprised me — how clean the hybrid query is in Elastic's syntax. You can combine kNN vector search with BM25 text match in one request, and Elastic handles the score fusion:
def hybrid_search(query, tag_filter=None, top_k=5):
q_emb = model.encode(query).tolist()
body = {
"knn": {
"field": "embedding",
"query_vector": q_emb,
"k": top_k,
"num_candidates": 50
},
"query": {
"bool": {
"must": [{"match": {"content": query}}],
"filter": [{"term": {"tags": tag_filter}}] if tag_filter else []
}
},
"_source": ["title", "content", "tags"]
}
results = es.search(index='knowledge-base', body=body)
return results['hits']['hits']
That filter parameter was a game changer. Now someone could say "show me onboarding docs about Slack" and it would semantically search only within the onboarding tag. The thing I'd been wrestling with for two weekends just... worked.
Closing the Loop: The RAG Part
Search by itself was useful. But what people actually wanted was answers, not links. So I added a thin RAG layer on top: take the top 3 search results, stuff them into a prompt context, and let an LLM synthesize an answer. The whole pipeline looks like this:
def ask(question, tag_filter=None):
# 1. Retrieve relevant chunks
hits = hybrid_search(question, tag_filter=tag_filter, top_k=3)
context = '\n\n'.join([h['_source']['content'] for h in hits])
# 2. Build prompt
prompt = f'''
You are a helpful assistant. Answer using only the context below.
Context: {context}
Question: {question}
Answer:'''
# 3. Call LLM (any OpenAI-compatible endpoint works here)
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{'role': 'user', 'content': prompt}]
)
return response.choices[0].message.content
The whole thing — from question to answer — takes under two seconds on Elastic Cloud. Latency for the Elasticsearch query itself is typically 30-80ms. The rest is LLM inference time.
What Actually Surprised Me
A few things I didn't expect:
• Elastic's hybrid search is genuinely better than either approach alone. Pure vector search struggles with proper nouns, specific version numbers, and jargon that doesn't embed well. Pure BM25 fails on conceptual queries. Together, they cover each other's blind spots.
• The update story is just... normal. Call es.index() with the doc id, it upserts. No index rebuilding, no downtime. This sounds obvious but after my FAISS experience it felt like magic.
• Kibana is legitimately useful for debugging. I could poke at my index, run test queries in the Dev Console, and see exactly why certain docs were or weren't surfacing. Debugging a vector search system without that visibility is brutal.
• Chunking strategy matters more than the model. I initially embedded whole pages (sometimes 2000+ words). The search got way better when I chunked into ~300 word segments with 50-word overlaps. The embedding model matters less than people say; chunking strategy matters more.
Honest Caveats
This setup works great for our 200-document knowledge base. I haven't stress-tested it at 200,000 documents. The kNN performance characteristics change as you scale, and you'd want to think harder about sharding strategy and num_candidates tuning. We also didn't go deep on security — the whole thing sits behind our internal VPN.
Also: the LLM still hallucinates occasionally when the retrieved context is ambiguous or contradictory. RAG doesn't solve hallucination, it just reduces it by grounding responses in your actual data. You still need to think about what happens when the retrieval step comes up empty.
Conclusion + Takeaways
I spent four weekends on a problem that, with Elastic, took one focused Saturday to actually solve properly. Part of that is skill issue — I should have reached for the right tool earlier. But part of it is that vector search infrastructure genuinely used to be more scattered. You needed separate systems, separate syncing, separate operational burden. The fact that all of this — vector indexing, kNN search, BM25, filtering, hybrid ranking — lives in one system with one query API is a real quality-of-life improvement.
If you're building a RAG pipeline and you're currently duct-taping together FAISS + a database + some custom sync script, it's worth trying Elastic for a week and seeing if it simplifies your stack. It simplified mine considerably.
The things I'd tell past-me:
• Chunk your docs into smaller pieces from the start. 300 words with overlap is a solid default.
• Use hybrid search, not just kNN. The BM25 component is doing real work.
• Build the retrieval evaluation before you build the generation layer. Know if your search is finding the right chunks before you bolt an LLM onto it.
• Elastic Cloud's free trial is genuinely enough to build and test the whole pipeline.
Full code is up on GitHub (link in bio). Happy to answer questions if anything here doesn't make sense — I've probably already made the mistake you're about to make.
Tags: vector search, Elasticsearch, RAG, Elastic Cloud, Python, semantic search, retrieval augmented generation, kNN, hybrid search
Word count: ~1,450 words
Top comments (0)