Why Vector-First Thinking Isn't Enough and How Hybrid Retrieval Unlocks Production-Grade Agent Memory
Disclaimer: This blog was submitted as part of Elastic Blogathon.
Author Introduction
I’m a Software Engineer at Nokia focused on building resilient, production-grade agentic systems. I've learned that the difference between a demo and a deployable solution often comes down to memory architecture. In this blog, I'll share technical insights from hands-on experience implementing Elasticsearch as the memory layer for autonomous AI agents in enterprise environments, where hybrid retrieval, scalability, and precision aren't just nice-to-haves—they're non-negotiable.
Abstract
AI agents need memory that thinks, not just stores. This blog explores why Elasticsearch—with its native vector search (HNSW + BBQ quantization), hybrid retrieval (dense + sparse + BM25), and production-hardened infrastructure—provides the ideal memory layer for autonomous agents. We'll build a working RAG pipeline with code, examine real-world use cases, analyze performance optimizations, and demonstrate why the future of agentic AI depends on intelligent, context-aware retrieval systems that combine semantic understanding with precision filtering.
The Agent Memory Problem: Why Pure Vector Search Falls Short
Every AI agent faces the same fundamental challenge: how to remember the right information at the right time. LLMs can reason brilliantly within their context window, but that window is finite. Even with million-token contexts, indiscriminately stuffing every piece of historical data leads to degraded reasoning, context poisoning, and hallucinations.
The industry's default answer has been to use a vector database. Embed your data, store the vectors, retrieve via nearest neighbour search. This works for simple semantic search, but it breaks down fast when agents need:
• Exact-match filtering on structured fields (user IDs, timestamps, entity types)
• Keyword relevance for precise terminology or error codes
• Multi-tenant isolation at scale (millions of users, billions of memories)
• Temporal freshness (recent events weighted higher than old ones)
• Hybrid signals combining semantic similarity with business logic

Pure vector databases excel at one thing: finding semantically similar content. But real agent memory is messier. When an agent asks "What did user #4471 said about their budget concerns yesterday?" you need exact matching on the user ID, semantic matching on budget-related intent, recency filtering for yesterday's interactions, and intelligent ranking to surface the most relevant snippet. Vector search alone can't deliver this.
This is where Elasticsearch distinguishes itself. It's not trying to be a vector database—it's an AI-native search engine with production-grade vector capabilities layered into a mature, battle-tested platform designed for exactly this kind of complex, multi-signal retrieval.
Technical Deep Dive: Elasticsearch's Vector Architecture
Dense Vector Search with HNSW
At the foundation of Elasticsearch's vector capabilities is the dense_vector field type, which stores high-dimensional embeddings and indexes them using HNSW (Hierarchical Navigable Small World)—the same approximate nearest neighbour algorithm powering most production vector systems.
Elasticsearch supports up to 4,096 dimensions with similarity metrics including cosine, L2 norm, dot product, and max inner product. The HNSW implementation builds a multi-layer graph structure that enables logarithmic-time approximate search, making it practical to query across millions or billions of vectors with sub-second latency.
BBQ Quantization: 95% Memory Reduction Without Sacrificing Recall
Storing 1,536-dimensional float32 vectors at scale is expensive. A million embeddings consume approximately 6GB of memory. Elasticsearch's Better Binary Quantization (BBQ)—now the default for vectors with 384+ dimensions—compresses vectors down to single-bit representations, achieving roughly 95% memory reduction.
The key innovation: BBQ doesn't just naively binarize vectors. It normalizes them around learned centroids, stores error-correction values, and uses oversampling with rescoring against the original float vectors. In benchmarks across 10 BEIR datasets, BBQ matched or exceeded raw float32 performance on NDCG@10 in 9 out of 10 cases while using 32x less memory. For agent systems storing millions of interactions, this is the difference between a manageable infrastructure budget and an unsustainable one.
Hybrid Retrieval: The Game-Changer for Production Agents
Here's where Elasticsearch truly separates itself from purpose-built vector stores. Real agent queries rarely need just semantic similarity. They need a blend of:
• Dense vectors for semantic understanding
• BM25 lexical search for exact term matching
• Sparse vectors (ELSER) for learned term expansion without fine-tuning
• Structured filters on metadata fields

Elasticsearch's Reciprocal Rank Fusion (RRF) elegantly merges these signals without requiring score normalization or calibration. BM25 scores and cosine similarities live on completely different scales—trying to blend them with weighted averaging is a tuning nightmare. RRF sidesteps this by looking only at rank positions. A document ranking #2 in BM25 and #3 in kNN gets a higher fused score than one ranking #1 in only kNN. Documents that appear across multiple retrieval methods naturally float to the top.
Building a Production Agent Memory System
Step 1: Index Mapping - Three Search Surfaces, One Document
The first step is defining an index that supports multiple retrieval strategies on the same data. Here's a production-grade mapping:
PUT /agent_memory
{
"mappings": {
"properties": {
"memory_id": { "type": "keyword" },
"user_id": { "type": "keyword" },
"content": { "type": "text" },
"embedding": {
"type": "dense_vector",
"dims": 1536,
"index": true,
"similarity": "cosine",
"index_options": { "type": "bbq_hnsw" }
},
"created_at": { "type": "date" },
"memory_type": { "type": "keyword" }
}
}
}
This mapping creates three search surfaces: (1) embedding for dense vector similarity, (2) content for BM25 lexical search, and (3) structured fields like user_id and memory_type for precise filtering.
Step 2: Ingesting Agent Memories
When an agent completes a session or reaches a checkpoint, we extract key memories, embed them, and persist to Elasticsearch:
from elasticsearch import Elasticsearch
from openai import OpenAI
import uuid
es = Elasticsearch("https://your-cluster.elastic-cloud.com")
client = OpenAI()
def store_memory(user_id: str, content: str, memory_type: str):
# Generate embedding
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=content
).data[0].embedding
# Index to Elasticsearch
es.index(
index="agent_memory",
id=str(uuid.uuid4()),
document={
"user_id": user_id,
"content": content,
"embedding": embedding,
"memory_type": memory_type,
"created_at": datetime.utcnow().isoformat()
}
)
Step 3: Hybrid Retrieval with RRF
When the agent needs to recall relevant memories, we fire a hybrid query that combines dense vectors, BM25, and filtering:
def recall_memories(user_id: str, query: str, top_k: int = 5):
# Embed the query
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# Hybrid search with RRF
response = es.search(
index="agent_memory",
size=top_k,
retriever={
"rrf": {
"retrievers": [
{ # Dense vector kNN
"knn": {
"field": "embedding",
"query_vector": query_embedding,
"k": 20,
"num_candidates": 100,
"filter": {"term": {"user_id": user_id}}
}
},
{ # BM25 lexical search
"standard": {
"query": {
"bool": {
"must": [{"match": {"content": query}}],
"filter": [{"term": {"user_id": user_id}}]
}
}
}
}
],
"rank_constant": 60
}
}
)
return [hit["_source"] for hit in response["hits"]["hits"]]
This hybrid query ensures we get the best of all worlds: semantic similarity via dense vectors, keyword relevance via BM25, and precise filtering on user_id. RRF intelligently combines these signals without manual score tuning.
Step 4: Integrating with LLM for RAG
Once we've retrieved relevant memories, we inject them into the LLM's context:
def agent_response(user_id: str, user_query: str):
# Retrieve relevant memories
memories = recall_memories(user_id, user_query, top_k=3)
# Build context from memories
context = "\n".join([m["content"] for m in memories])
# Generate response with RAG
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"""You are an AI assistant.
Use the following context to answer the user's question:
{context}""},
{"role": "user", "content": user_query}
]
)
return response.choices[0].message.content
Real-World Use Cases: Where Elasticsearch Powers Production Agents
Use Case 1: Enterprise Customer Support Agent
An autonomous customer support agent handling 10,000+ daily interactions needs to remember:
• Past conversations with each customer (episodic memory)
• Product documentation and troubleshooting guides (knowledge base)
• Successful resolution patterns for common issues
• User preferences and communication style
Elasticsearch's hybrid retrieval allows the agent to:
• Find semantically similar past issues (vector search)
• Filter by exact customer ID and date ranges (structured filters)
• Match precise error codes or product names (BM25)
• Prioritize recent interactions over old ones (recency boosting)
Use Case 2: DevOps Observability Agent
An agent monitoring infrastructure ingests millions of log events, metrics, and traces. When an anomaly is detected, it needs to:
• Find similar past incidents (vector similarity on error signatures)
• Match exact service names and error codes (keyword search)
• Filter by severity level and timeframe (structured queries)
• Retrieve successful remediation runbooks (hybrid retrieval)
Elasticsearch's observable DNA makes it perfect for this. The same platform that indexes log also powers the agent's memory, eliminating data duplication and latency.
Performance Optimizations for Production Scale
1. BBQ Quantization for Cost Efficiency
As mentioned earlier, BBQ reduces memory consumption by 95% while maintaining recall. For an agent system storing 10 million memories with 1536-dimensional embeddings:
• Without BBQ: ~61 GB memory required
• With BBQ: ~3 GB memory required
This enables horizontal scaling without infrastructure costs spiralling out of control.
2. Filtered kNN with ACORN
Most vector databases suffer from post-filtering performance cliffs. When you apply a filter (e.g., user_id = '12345'), they first retrieve N neighbours, then filter, often resulting in too few results or requiring multiple passes.
Elasticsearch's ACORN (Approximate Closest Neighbours with HNSW) integrates filters directly into the graph traversal. This means filtering happens during search, not after, maintaining consistent performance even with highly selective filters.
3. Chunking Strategy
Optimal chunk size for agent memory is context-dependent, but empirically:
• 300-500 tokens per chunk balances context richness with retrieval precision
• 50-100 token overlap prevents semantic boundary loss
• Summarize long conversations before indexing to reduce noise
Production Lessons: What I Learned Deploying This in the Wild
In my experience deploying Elasticsearch-backed agent memory systems handling millions of daily interactions, several non-obvious insights emerged:
1. Context Quality Matters More Than Context Quantity
Early iterations retrieved too many memories (top 20, top 50). We found top 3 to top 5 highly relevant memories consistently outperformed flooding the context window with marginally relevant data. The hybrid retrieval's precision made this possible.
2. Memory Types Need Different TTLs
Not all memories age equally. We implemented tiered TTLs:
• Short-term task context: 7 days
• User preferences: 90 days (refreshed on each access)
• Domain knowledge: Permanent
Elasticsearch's date-based filtering made this trivial to implement.
3. Observability is Critical
We instrumented every retrieval query to log:
• Which retrieval signals contributed to top results
• Score distributions across vector, BM25, and ELSER
• Filter effectiveness (how many candidates passed filters)
This telemetry revealed bottlenecks and guided optimizations that wouldn't have been obvious from end-to-end latency metrics alone.
Beyond RAG: Why Agent Memory Needs More Than Retrieval
The RAG pattern has dominated AI memory discussions, but it's fundamentally a reactive paradigm: the agent asks a question, the memory system retrieves relevant context, and the LLM generates a response. This works for Q&A bots, but autonomous agents need proactive memory.
What does proactive memory look like?
• Memory consolidation: Periodically summarizing and re-embedding episodic memories to reduce noise
• Pattern detection: Using aggregations to identify recurring issues or preferences
• Predictive pre-fetching: Anticipating which memories will be needed based on agent state
• Memory-driven planning: Using historical success/failure rates to guide future actions
Elasticsearch's aggregation framework, update-by-query capabilities, and real-time indexing enable all these patterns. The memory layer becomes an active reasoning substrate, not just a passive lookup table.
System Architecture Diagram
Below is a high-level architecture showing how Elasticsearch integrates into an agent system:
Conclusion & Key Takeaways
Building production-grade AI agents requires more than clever prompting and powerful LLMs. It demands a memory architecture that can:
• Understand meaning through semantic vectors
• Deliver precision through keyword matching and filters
• Scale to billions of memories without infrastructure collapse
• Integrate seamlessly into existing data ecosystems
Elasticsearch delivers on all fronts. Its HNSW-based vector search rivals purpose-built vector databases in performance. Its BBQ quantization makes massive-scale deployments economically viable. Its hybrid retrieval with RRF solves the precision-recall trade-off that breaks pure vector approaches. And its battle-tested infrastructure means you're building on a foundation designed for production from day one.
Key Takeaways
• Pure vector search is necessary but not sufficient for agent memory. Hybrid retrieval combining dense vectors, sparse vectors, and BM25 lexical search delivers superior results.
• Quantization isn't a compromise—BBQ achieves 95% memory reduction while maintaining or improving recall, making scale economically feasible.
• RRF eliminates score normalization hell—by fusing results based on rank positions, not raw scores, it delivers robust hybrid retrieval without manual tuning.
• Memory quality > memory quantity—top-3 highly relevant memories consistently outperform top 20 marginally relevant ones.
• Elasticsearch is production-ready—with distributed architecture, observability tooling, and mature ecosystem integration, it's built for real workloads, not demos.
As AI agents evolve from reactive chatbots to autonomous systems capable of complex reasoning and multi-step planning, their memory layer becomes the foundation everything else builds on. Elasticsearch—with its unique combination of semantic understanding, lexical precision, and production-grade infrastructure—isn't just a good choice for agent memory. It's the right choice.


Top comments (0)