DEV Community: Kacper Łukawski

Context Engineering for Agentic Systems: What Goes Into Your Agent's Mind

Kacper Łukawski — Mon, 20 Apr 2026 10:56:42 +0000

Every new generation of Large Language Models arrives with a bigger context window - and the temptation to use it fully. If the model can read a million tokens, why not feed it everything? In practice, more context doesn't reliably mean better answers: it often means higher costs, slower responses, and a model that loses track of what actually matters. Context engineering is the discipline of deciding not just what to put in the context window, but how much, in what form, and when to leave things out - and it's quickly becoming one of the most important skills in building reliable agentic systems.

Why context is so important for agentic systems

An LLM has exactly two sources of information when generating a response:

Internal state ("knowledge") - what was baked in during training. It is static, potentially stale, and opaque to the developer.
Context ("prompt") - everything provided at inference time. That's the only thing we can actively control.

Training knowledge is fixed. We can't update it without retraining, and we can't know exactly what the model does or doesn't know - though most providers publish a knowledge cutoff date in their model cards or documentation, which tells you the point beyond which the model has no awareness of world events. Context is the lever we actually have. Everything a model knows about the current task, the current user, the tools available to it, and the world right now has to come through the context window.

Today's leading models offer context windows that would have seemed impossibly large just a few years ago - millions of tokens, enough to fit entire codebases, legal contracts, or a stack of research papers in a single prompt. Yet in practice, agentic systems burn through these limits surprisingly fast. A system prompt, a set of tool definitions, all tool calls and results, a few retrieved documents, and a handful of conversation turns can easily consume tens of thousands of tokens before the agent has done anything meaningful. And even when the hard limit isn't reached, performance often degrades long before it is - the model starts losing track of earlier instructions, repeating itself, or missing relevant details buried under layers of accumulated context.

At step 1, the context holds little more than the user's task. By step N, it has grown to include every tool call, every result, every model response, and any retrieved documents - all concatenated and re-sent from scratch on every iteration.

The difference from one-shot prompting is stark. A single prompt is small, hand-crafted, and fully under control. An agentic system operates in a loop - reasoning, calling tools, receiving results, and repeating, potentially dozens of times. Because LLMs are stateless, every iteration re-sends the entire accumulated history from scratch. The context isn't a fixed input, but more of a growing log, and context engineering is about managing that growth.

When less is more

Transformers architecture behind the LLMs work by letting every token attend to every other token in the context. This is what makes them so powerful at integrating information - but it also means the model's capacity is spread across all tokens simultaneously. Think of it as an attention budget: every new token you introduce depletes it by some amount, regardless of whether that token is useful or not.

The practical consequence is that irrelevant or redundant content doesn't just waste space - it actively competes with the information that actually matters. A critical instruction buried under pages of tool outputs may receive less attention than if it had been sent alone. Research from Anthropic confirms this: models remain capable at longer contexts but show reduced precision for information retrieval and long-range reasoning compared to shorter ones. A million-token context window is not a free pass to include everything - it's a budget, and every token you add is a trade-off.

The cost dimension

Most hosted LLMs charge per input token, which means every byte of context has a price tag. A single call with a 50,000-token context costs roughly 50× more than one with 1,000 tokens - and in an agentic loop that runs dozens of iterations, that multiplier compounds with every step. Context management is therefore not just a quality concern but a cost concern: a bloated context window can turn a cheap pipeline into an expensive one without producing any better answers.

What fills the context window in an agentic system

We've already touched on some of the components that fill an agent's context window - system prompts, tool definitions, retrieved documents. Let's map out the full picture, because the list is longer than many developers expect.

System prompt - standing instructions, persona, constraints, output format. Usually fixed but can be large.
Conversation history - the full back-and-forth between user and agent across the current session.
Memory - retrieved facts from past sessions or external knowledge stores. See also: Using Mem0 Memory Store with Haystack Agents.
Retrieval output - documents or chunks fetched proactively by a RAG pipeline, before the model acts. This data arrives in context as part of the input to the model, not as a consequence of something the model decided to do.
Tool definitions - every tool the model could call must be described in the context (name, description, parameters schema). With MCP toolsets, this can easily balloon into hundreds of tool descriptions.
Tool call results - the output of tools the model itself chose to invoke. Unlike retrieval output, these arrive mid-session as a consequence of the model's actions. They can be surprisingly large: a read file operation returning a 500-line source file, a web search returning multiple scraped pages, or a database query returning hundreds of rows - and each result stays in context for the remainder of the session.
Few-shot examples - demonstration input/output pairs used to guide model behaviour.

The iceberg effect. A user sees a single answer. Behind the scenes, the model may have received 50,000 tokens or more on that one turn - a system prompt (perhaps 10k tokens), tool definitions (5k), retrieved documents (20k), and accumulated conversation history (15k). The answer is the tip, while the context is everything below the surface.

What the context actually looks like

The screenshot above shows Claude Code's /context command, which breaks down exactly where tokens are being spent: system prompt, tool definitions, conversation history, open files, etc. Knowing this makes it possible to identify which component is responsible for a bloated context and whether that cost is justified. With this visibility, optimisation is a bit easier.

Building a Haystack agent

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool

@tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    return f"It's sunny and 22°C in {city}."

agent = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="You are a helpful assistant.",
    tools=[get_weather],
)

result = agent.run(messages=[ChatMessage.from_user("What's the weather in Paris?")])
print(result["last_message"].text)

When you create an agent in Haystack, much of the context is assembled automatically. Tool descriptions are serialised and injected into the prompt under the hood - you define a tool once, and the framework ensures the model receives everything it needs to call it: the name, description, and parameter schema. The same applies to conversation history, which is maintained across turns without any manual concatenation. The context you see in your code is just the surface, but the model receives considerably more on every call.

Strategies for managing context growth

Context explosion is not inevitable. Once you understand what's filling the window, you can start making choices about what actually needs to be there. There are several proven techniques for keeping context short without sacrificing quality.

Delegation to subagents

Another way to keep context small is to never let it grow large in the first place. Instead of one agent accumulating the full history of a complex task, you can split the work across specialised subagents - each one receiving only the slice of context relevant to its job. The orchestrator maintains a thin, high-level context, while the worker agents get focused, task-specific contexts. The total token count across the system may be similar, but no single model call is burdened with everything at once. For a practical example of this pattern in Haystack, see Building a Swarm of Agents or the Creating a Multi-Agent System with Haystack tutorial.

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web for up-to-date information on a topic."""
    return f"Search results for '{query}': ..."

# Worker agent: only receives context relevant to its task
researcher = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="You are a research assistant. Answer questions concisely.",
    tools=[search_web],
)

from haystack.tools import ComponentTool

delegate_research = ComponentTool(
    component=researcher,
    name="delegate_research",
    description="Delegate a research question to a specialised agent.",
    outputs_to_string={"source": "last_message"},
)

# Orchestrator: only sees compact summaries from worker agents
orchestrator = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="Break down tasks and delegate them to specialised agents.",
    tools=[delegate_research],
)

result = orchestrator.run(messages=[ChatMessage.from_user("Compare quantum and classical computing.")])
print(result["last_message"].text)

Improving retrieval quality

In RAG pipelines, retrieval quality directly determines how many tokens land in the context. Poor retrieval returns irrelevant chunks that add noise without adding value - each one consuming part of the attention budget. Better precision means fewer chunks are needed, which means a smaller, cleaner context.

A related problem is redundancy: when retrieved passages are near-duplicates, the model sees the same information repeated multiple times without gaining anything new. This is why diversity matters as much as relevance - a set of chunks that each cover a different facet of the question is far more efficient than a set of very similar top matches. Techniques like hybrid retrieval, HyDE, query decomposition, and auto-merging retrieval all help surface results that are both more relevant and more varied.

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

rag = Pipeline()
rag.add_component("embedder", SentenceTransformersTextEmbedder())
# Retrieve 10 candidates, then rerank to the 3 most relevant
rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=10))
rag.add_component("ranker", TransformersSimilarityRanker(top_k=3))

rag.connect("embedder.embedding", "retriever.query_embedding")
rag.connect("retriever.documents", "ranker.documents")

result = rag.run({
    "embedder": {"text": "climate change"},
    "ranker": {"query": "climate change"},
})
# result["ranker"]["documents"] now contains at most 3 highly relevant chunks

Coming up in the series: Retrieval quality deserves a post of its own. The next article will go deep on techniques for surfacing more relevant, more diverse results - so your RAG pipelines put more important tokens in front of the model.

Summarisation and compaction

As a conversation grows, the raw message history becomes the biggest consumer of context. Compaction addresses this by periodically replacing the accumulated history with a condensed summary - retaining the essential facts and decisions while discarding the verbatim back-and-forth. The agent continues with a much shorter context, and the summary is updated with each new turn.

This pattern is well-established in practice. Popular coding agents' context compaction feature works exactly this way: when the context approaches its limit, it summarises the conversation so far and continues from the summary rather than truncating or failing.

from haystack import Pipeline
from haystack.core.component import component
from haystack.components.agents import Agent
from haystack.components.builders import ChatPromptBuilder
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.components.writers import ChatMessageWriter

@tool
def get_current_date() -> str:
    """Return today's date."""
    from datetime import date
    return date.today().isoformat()

@component
class HistoryCompactor:
    def __init__(self, threshold: int = 3):
        self.threshold = threshold
        self.compactor = ChatPromptBuilder(
            template=[
                ChatMessage.from_user(
                    "Summarise the key facts from the conversation below in "
                    "3-5 bullet points.\n\n"
                    "{{ history }}"
                )
            ],
            required_variables=["history"],
        )
        self.summariser = AnthropicChatGenerator(model="claude-haiku-4-5-20251001")

    @component.output_types(messages=list[ChatMessage])
    def run(self, messages: list[ChatMessage]) -> dict:
        if len(messages) <= self.threshold:
            return {"messages": messages}
        history_text = "\n".join(f"{m.role}: {m.text}" for m in messages if m.text)
        prompt = self.compactor.run(template_variables={"history": history_text})["prompt"]
        summary = self.summariser.run(messages=prompt)["replies"][0].text
        # The output message has to be a user message, as our chat 
        # generator cannot work with just system/assistant messages
        return {
            "messages": [
                ChatMessage.from_user(f"Conversation so far (summary):\n{summary}")
            ]
        }

# skip_system_messages=False so the compacted summary (a system message) is persisted
message_store = InMemoryChatMessageStore(skip_system_messages=False)

pipeline = Pipeline()
pipeline.add_component("message_retriever", ChatMessageRetriever(message_store))
pipeline.add_component("compactor", HistoryCompactor(threshold=3))
pipeline.add_component(
    "agent", 
    Agent(
        chat_generator=AnthropicChatGenerator(model="claude-haiku-4-5-20251001"),
        system_prompt="You are a helpful assistant.",
        tools=[get_current_date],
    )
)
pipeline.add_component("message_writer", ChatMessageWriter(message_store))

pipeline.connect("message_retriever.messages", "compactor.messages")
pipeline.connect("compactor.messages", "agent.messages")
pipeline.connect("agent.messages", "message_writer.messages")

chat_history_id = "session_1"

# First turn
pipeline.run({
    "message_retriever": {
        "current_messages": [ChatMessage.from_user("What day is it today?")],
        "chat_history_id": chat_history_id,
    },
    "message_writer": {"chat_history_id": chat_history_id},
})

# Second turn - history is retrieved, compacted if needed, and stored back automatically
result = pipeline.run({
    "message_retriever": {
        "current_messages": [ChatMessage.from_user("What month are we in?")],
        "chat_history_id": chat_history_id,
    },
    "message_writer": {"chat_history_id": chat_history_id},
})
print(result["agent"]["last_message"].text)

Adding only relevant tools to the context

Tool definitions can be a surprisingly large slice of the context window, especially when connecting to MCP servers that expose dozens or hundreds of tools. Listing every tool upfront means the model receives all those descriptions on every single call, regardless of which tool is actually needed.

SearchableToolset, introduced in Haystack 2.25, inverts this approach. Instead of exposing the full catalog, the agent starts with a single search_tools function and uses it to dynamically discover relevant tools via BM25 keyword search. Only the tools it actually needs are loaded into the context for that turn.

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import Tool, SearchableToolset

# Create a catalog of tools
catalog = [
    Tool(name="get_weather", description="Get weather for a city", ...),
    Tool(name="search_web", description="Search the web", ...),
    # ... 100s more tools
]
toolset = SearchableToolset(catalog=catalog)

agent = Agent(chat_generator=AnthropicChatGenerator(), tools=toolset)

# The agent is initially provided only with the search_tools tool
# and will use it to find relevant tools on demand.
result = agent.run(messages=[ChatMessage.from_user("What's the weather in Milan?")])

Offloading notes (scratchpad / working memory)

An agent's intermediate reasoning - the chain of thoughts it builds up while working through a multi-step task - does not have to live inside the context window. A simple alternative is to give the agent two dedicated tools: one to write a note to an external store, and one to read notes back. Instead of accumulating its internal monologue in the prompt, the agent can offload conclusions, partial results, and reminders to storage and retrieve only what it needs at each step.

This keeps the context lean: rather than carrying the full trace of every intermediate thought, the agent holds a minimal working state and queries its own notes on demand. The pattern is especially useful for long-horizon tasks where the reasoning chain would otherwise grow without bound, and it has the side effect of making the agent's thinking inspectable and debuggable from outside the model.

What's coming next in this series

This article is the foundation of a series on context engineering. Future posts will go deeper on specific topics - measuring whether your context actually helps the model, keeping context manageable in long-running agent loops, diversifying retrieval results, tracking token usage across pipelines, and more. If there is a particular area you would like us to cover first, let us know.

To stay up to date with the series and everything else happening in Haystack, star the Haystack GitHub repository and join the conversation on Discord.

What is Agentic RAG? Building Agents with Qdrant

Kacper Łukawski — Mon, 25 Nov 2024 18:19:49 +0000

Standard Retrieval Augmented Generation follows a predictable, linear path: receive a query, retrieve relevant documents, and generate a response. In many cases that might be enough to solve a particular problem. In the worst case scenario, your LLM will just decide to not answer the question, because the context does not provide enough information.

On the other hand, we have agents. These systems are given more freedom to act, and can take multiple non-linear steps to achieve a certain goal. There isn't a single definition of what an agent is, but in general, it is an applicationthat uses LLM and usually some tools to communicate with the outside world.

LLMs are used as decision-makers which decide what action to take next. Actions can be anything, but they are usually well-defined and limited to a certain set of possibilities. One of these actions might be to query a vector database, like Qdrant, to retrieve relevant documents, if the context is not enough to make a decision.

However, RAG is just a single tool in the agent's arsenal.

Agentic RAG: Combining RAG with Agents

Since the agent definition is vague, the concept of Agentic RAG is also not well-defined. In general, it refers to the combination of RAG with agents. This allows the agent to use external knowledge sources to make decisions, and primarily to decide when the external knowledge is needed.

We can describe a system as Agentic RAG if it breaks the
linear flow of a standard RAG system, and gives the agent the ability to take multiple steps to achieve a goal.

A simple router that chooses a path to follow is often described as the simplest form of an agent. Such a system has multiple paths with conditions describing when to take a certain path. In the context of Agentic RAG, the agent can decide to query a vector database if the context is not enough to answer, or skip the query if it's enough, or when the question refers to common knowledge.

Alternatively, there might be multiple collections storing different kinds of information, and the agent can decide which collection to query based on the context. The key factor is that the decision of choosing a path is made by the LLM, which is the core of the agent. A routing agent never comes back to the previous step, so it's ultimately just a conditional decision-making system.

However, routing is just the beginning. Agents can be much more complex, and extreme forms of agents can have completefreedom to act. In such cases, the agent is given a set of tools and can autonomously decide which ones to use, how to use them, and in which order. LLMs are asked to plan and execute actions, and the agent can take multiple steps to achieve a goal, including taking steps back if needed. Such a system does not have to follow a DAG structure (Directed Acyclic Graph), and can have loops that help to self-correct the decisions made in the past.

An agentic RAG system built in that manner can have tools not only to query a vector database, but also to play with the query, summarize the
results, or even generate new data to answer the question. Options are endless, but there are some common patterns that can be observed in the wild.

Solving Information Retrieval Problems with LLMs

Generally speaking, tools exposed in an agentic RAG system are used to solve information retrieval problems which are not new to the search community. LLMs have changed how we approach these problems, but the core of the problem remains the same. What kind of tools you can consider using in an agentic RAG? Here are some examples:

Querying a vector database - the most common tool used in agentic RAG systems. It allows the agent to retrieve relevant documents based on the query.
Query expansion - a tool that can be used to improve the query. It can be used to add synonyms, correct typos, or even to generate new queries based on the original one.

Extracting filters - vector search alone is sometimes not enough. In many cases, you might want to narrow down the results based on specific parameters. This extraction process can automatically identify relevant conditions from the query. Otherwise, your users would have to manually define these search constraints.

Quality judgement - knowing the quality of the results for given query can be used to decide whether they are good enough to answer, or if the agent should take another step to improve them somehow. Alternatively it can also admit the failure to provide good response.

These are just some of the examples, but the list is not exhaustive. For example, your LLM could possibly play with Qdrant search parameters or choose different methods to query it. An example? If your users are searching using some specific keywords, you may prefer sparse vectors to dense vectors, as they are more efficient in such cases. In that case you have to arm your agent with tools to decide when to use sparse vectors and when to use dense vectors. Agentaware of the collection structure can make such decisions easily.

Each of these tools might be a separate agent on its own, and multi-agent systems are not uncommon. In such cases, agents can communicate with each other, and one agent can decide to use another agent to solve a particular problem.

Pretty useful component of an agentic RAG is also a human in the loop, which can be used to correct the agent's decisions, or steer it in the right direction.

Where are Agents Used?

Agents are an interesting concept, but since they heavily rely on LLMs, they are not applicable to all problems. Using Large Language Models is expensive and tend to be slow, what in many cases, it's not worth the cost. Standard RAG involves just a single call to the LLM, and the response is generated in a predictable way. Agents, on the other hand, can take multiple steps, and the latency experienced by the user adds up.

In many cases, it's not acceptable. Agentic RAG is probably not that widely applicable in ecommerce search, where the user expects a quick response, but might be fine for customer support, where the user is willing to wait a bit longer for a better answer.

Which Framework is Best?

There are lots of frameworks available to build agents, and choosing the best one is not easy. It depends on your existing stack or the tools you are familiar with. Some of the most popular LLM libraries have already drifted towards the agent paradigm, and they are offering tools to build them. There are, however, some tools built primarily for agents development, so let's focus on them.

LangGraph

Developed by the LangChain team, LangGraph seems like a natural extension for those who already use LangChain for building their RAG systems, and would like to start with agentic RAG.

Surprisingly, LangGraph has nothing to do with Large Language Models on its own. It's a framework for building graph-based applications in which each node is a step of the workflow. Each node takes an application state as an input, and produces a modified state as an output. The state is then passed to the next node, and so on. Edges between the nodes might be conditional what makes branching possible. Contrary to some DAG-based tool (i.e. Apache Airflow), LangGraph allows for loops in the graph, which makes it possible to implement cyclic workflows, so an agent can achieve self-reflection and self-correction. Theoretically, LangGraph can be used to build any kind of applications in a graph-based manner, not only LLM agents.

Some of the strengths of LangGraph include:

Persistence - the state of the workflow graph is stored as a checkpoint. That happens at each so-called super-step (which is a single sequential node of a graph). It enables replying certain steps of the workflow, fault-tolerance, and including human-in-the-loop interactions. This mechanism also acts as a short-term memory, accessible in a context of a particular workflow execution.
Long-term memory - LangGraph also has a concept of memories that are shared between different workflow runs. However, this mechanism has to explicitly handled by our nodes. Qdrant with its semantic search capabilities is often used as a long-term memory layer.
Multi-agent support - while there is no separate concept of multi-agent systems in LangGraph, it's possible to create such an architecture by building a graph that includes multiple agents and some kind of supervisor that makes a decision which agent to use in a given situation. If a node might be anything, then it might be another agent as well.

Some other interesting features of LangGraph include the ability to visualize the graph, automate the retries of failed steps, and include human-in-the-loop interactions.

A minimal example of an agentic RAG could improve the user query, e.g. by fixing typos, expanding it with synonyms, or even generating a new query based on the original one. The agent could then retrieve documents from a vector database based on the improved query, and generate a response. The LangGraph app implementing this approach could look like this:

from typing import Sequence
from typing_extensions import TypedDict, Annotated
from langchain_core.messages import BaseMessage
from langgraph.constants import START, END
from langgraph.graph import add_messages, StateGraph


class AgentState(TypedDict):
    # The state of the agent includes at least the messages exchanged between the agent(s) 
    # and the user. It is, however, possible to include other information in the state, as 
    # it depends on the specific agent.
    messages: Annotated[Sequence[BaseMessage], add_messages]


def improve_query(state: AgentState):
    ...

def retrieve_documents(state: AgentState):
    ...

def generate_response(state: AgentState):
    ...

# Building a graph requires defining nodes and building the flow between them with edges.
builder = StateGraph(AgentState)

builder.add_node("improve_query", improve_query)
builder.add_node("retrieve_documents", retrieve_documents)
builder.add_node("generate_response", generate_response)

builder.add_edge(START, "improve_query")
builder.add_edge("improve_query", "retrieve_documents")
builder.add_edge("retrieve_documents", "generate_response")
builder.add_edge("generate_response", END)

# Compiling the graph performs some checks and prepares the graph for execution.
compiled_graph = builder.compile()

# Compiled graph might be invoked with the initial state to start.
compiled_graph.invoke({
    "messages": [
        ("user", "Why Qdrant is the best vector database out there?"),
    ]
})

Each node of the process is just a Python function that does certain operation. You can call an LLM of your choice inside of them, if you want to, but there is no assumption about the messages being created by any AI. LangGraph rather acts as a runtime that launches these functions in a specific order, and passes the state between them.

While LangGraph integrates well with the LangChain ecosystem, it can be used independently. For teams looking for additional support and features, there's also a commercial offering called LangGraph Platform. The framework is available for both Python and JavaScript environments, making it possible to be used in different tech stacks.

CrewAI

CrewAI is another popular choice for building agents, including agentic RAG. It's a high-level framework that assumesthere are some LLM-based agents working together to achieve a common goal. That's where the "crew" in CrewAI comes from. CrewAI is designed with multi-agent systems in mind. Contrary to LangGraph, the developer does not create a graph of
processing, but defines agents and their roles within the crew.

Some of the key concepts of CrewAI include:

Agent - a unit that has a specific role and goal, controlled by an LLM. It can optionally use some external tools to communicate with the outside world, but generally steered by prompt we provide to the LLM.
Process - currently either sequential or hierarchical. It defines how the task will be executed by the agents. In a sequential process, agents are executed one after another, while in a hierarchical process, agent is selected by the manager agent, which is responsible for making decisions about which agent to use in a given situation.
Roles and goals - each agent has a certain role within the crew, and the goal it should aim to achieve. These are set when we define an agent and are used to make decisions about which agent to use in a given situation.
Memory - an extensive memory system consists of short-term memory, long-term memory, entity memory, and contextual memory that combines the other three. There is also user memory for preferences and personalization. This is where Qdrant comes into play, as it might be used as a long-term memory layer.

CrewAI provides a rich set of tools integrated into the framework. That may be a huge advantage for those who want to combine RAG with e.g. code execution, or image generation. The ecosystem is rich, however brining your own tools is not a big deal, as CrewAI is designed to be extensible.

A simple agentic RAG application implemented in CrewAI could look like this:

from crewai import Crew, Agent, Task
from crewai.memory.entity.entity_memory import EntityMemory
from crewai.memory.short_term.short_term_memory import ShortTermMemory
from crewai.memory.storage.rag_storage import RAGStorage

class QdrantStorage(RAGStorage):
    ...

response_generator_agent = Agent(
    role="Generate response based on the conversation",
    goal="Provide the best response, or admit when the response is not available.",
    backstory=(
        "I am a response generator agent. I generate "
        "responses based on the conversation."
    ),
    verbose=True,
)

query_reformulation_agent = Agent(
    role="Reformulate the query",
    goal="Rewrite the query to get better results. Fix typos, grammar, word choice, etc.",
    backstory=(
        "I am a query reformulation agent. I reformulate the " 
        "query to get better results."
    ),
    verbose=True,
)

task = Task(
    description="Let me know why Qdrant is the best vector database out there.",
    expected_output="3 bullet points",
    agent=response_generator_agent,
)

crew = Crew(
    agents=[response_generator_agent, query_reformulation_agent],
    tasks=[task],
    memory=True,
    entity_memory=EntityMemory(storage=QdrantStorage("entity")),
    short_term_memory=ShortTermMemory(storage=QdrantStorage("short-term")),
)
crew.kickoff()

Disclaimer: QdrantStorage is not a part of the CrewAI framework, but it's taken from the Qdrant documentation on how to integrate Qdrant with CrewAI.

Although it's not a technical advantage, CrewAI has a great documentation. The framework is available for Python, and it's easy to get started with it. CrewAI also has a commercial offering, CrewAI Enterprise, which provides a platform for building and deploying agents at scale.

AutoGen

AutoGen emphasizes multi-agent architectures as a fundamental design principle. The framework requires at least two agents in any system to really call an application agentic - typically an assistant and a user proxy exchange messages to achieve a common goal. Sequential chat with more than two agents is also supported, as well as group chat and nested
chat for internal dialogue. However, AutoGen does not assume there is a structured state that is passed between the agents, and the chat conversation is the only way to communicate between them.

There are many interesting concepts in the framework, some of them even quite unique:

Tools/functions - external components that can be used by agents to communicate with the outside world. They are defined as Python callables, and can be used for any external interaction we want to allow the agent to do. Type annotations are used to define the input and output of the tools, and Pydantic models are supported for more complex type schema. AutoGen supports only OpenAI-compatible tool call API for the time being.
Code executors - built-in code executors include local command, Docker command, and Jupyter. An agent can write and launch code, so theoretically the agents can do anything that can be done in Python. None of the other frameworks made code generation and execution that prominent. Code execution being the first-class citizen in AutoGen is an interesting concept.

Each AutoGen agent uses at least one of the components: human-in-the-loop, code executor, tool executor, or LLM. A simple agentic RAG, based on the conversation of two agents which can retrieve documents from a vector database, or improve the query, could look like this:

from os import environ

from autogen import ConversableAgent
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
from qdrant_client import QdrantClient

client = QdrantClient(...)

response_generator_agent = ConversableAgent(
    name="response_generator_agent",
    system_message=(
        "You answer user questions based solely on the provided context. You ask to retrieve relevant documents for "
        "your query, or reformulate the query, if it is incorrect in some way."
    ),
    description="A response generator agent that can answer your queries.",
    llm_config={"config_list": [{"model": "gpt-4", "api_key": environ.get("OPENAI_API_KEY")}]},
    human_input_mode="NEVER",
)

user_proxy = RetrieveUserProxyAgent(
    name="retrieval_user",
    llm_config={"config_list": [{"model": "gpt-4", "api_key": environ.get("OPENAI_API_KEY")}]},
    human_input_mode="NEVER",
    retrieve_config={
        "task": "qa",
        "chunk_token_size": 2000,
        "vector_db": "qdrant",
        "db_config": {"client": client},
        "get_or_create": True,
        "overwrite": True,
    },
)

result = user_proxy.initiate_chat(
    response_generator_agent,
    message=user_proxy.message_generator,
    problem="Why Qdrant is the best vector database out there?",
    max_turns=10,
)

For those new to agent development, AutoGen offers AutoGen Studio, a low-code interface for prototyping agents. While not intended for production use, it significantly lowers the barrier to entry for experimenting with agent architectures.

It's worth noting that AutoGen is currently undergoing significant updates, with version 0.4.x in development introducing substantial API changes compared to the stable 0.2.x release. While the framework currently has limited built-in persistence and state management capabilities, these features may evolve in future releases.

OpenAI Swarm

Unliked the other frameworks described in this article, OpenAI Swarm is an educational project, and it's not ready for production use. It's worth mentioning, though, as it's pretty lightweight and easy to get started with. OpenAI Swarm is an experimental framework for orchestrating multi-agent workflows that focuses on agent coordination through direct handoffs rather than complex orchestration patterns.

With that setup, agents are just exchanging messages in a chat, optionally calling some Python functions to communicate with external services, or handing off the conversation to another agent, if the other one seems to be more suitable to answer the question. Each agent has a certain role, defined by the instructions we have to define.

We have to decide which LLM will a particular agent use, and a set of functions it can call. For example, a retrieval agent could use a vector database to retrieve documents, and return the results to the next agent. That means, there should be a function that performs the semantic search on its behalf, but the model will decide how the query should look like.

Here is how a similar agentic RAG application, implemented in OpenAI Swarm, could look like:

from swarm import Swarm, Agent

client = Swarm()

def retrieve_documents(query: str) -> list[str]:
    """
    Retrieve documents based on the query.
    """
    ...

def transfer_to_query_improve_agent():
    return query_improve_agent

query_improve_agent = Agent(
    name="Query Improve Agent",
    instructions=(
        "You are a search expert that takes user queries and improves them to get better results. You fix typos and "
        "extend queries with synonyms, if needed. You never ask the user for more information."
    ),
)

response_generation_agent = Agent(
    name="Response Generation Agent",
    instructions=(
        "You take the whole conversation and generate a final response based on the chat history. "
        "If you don't have enough information, you can retrieve the documents from the knowledge base or "
        "reformulate the query by transferring to other agent. You never ask the user for more information. "
        "You have to always be the last participant of each conversation."
    ),
    functions=[retrieve_documents, transfer_to_query_improve_agent],
)

response = client.run(
    agent=response_generation_agent,
    messages=[
        {
            "role": "user",
            "content": "Why Qdrant is the best vector database out there?"
        }
    ],
)

Even though we don't explicitly define the graph of processing, the agents can still decide to hand off the processing to a different agent. There is no concept of a state, so everything relies on the messages exchanged between different components.

OpenAI Swarm does not focus on integration with external tools, and if you would like to integrate semantic search with Qdrant, you would have to implement it fully yourself. Obviously, the library is tightly coupled with OpenAI models, and while using some other ones is possible, it requires some additional work like setting up proxy that will
adjust the interface to OpenAI API.

The winner?

Choosing the best framework for your agentic RAG system depends on your existing stack, team expertise, and the specific requirements of your project. All the described tools are strong contenders, and they are developed at rapid pace. It's worth keeping an eye on all of them, as they are likely to evolve and improve over time. Eventually, you should be able to build the same processes with any of them, but some of them may be more suitable in a specific ecosystem of the tools you want your agent to interact with.

There are, however, some important factors to consider when choosing a framework for your agentic RAG system:

Human-in-the-loop - even though we aim to build autonomous agents, it's often important to include the feedback from the human, so our agents cannot perform malicious actions.
Observability - how easy it is to debug the system, and how easy it is to understand what's happening inside. Especially important, since we are dealing with lots of LLM prompts.

Still, choosing the right toolkit depends on the state of your project, and the specific requirements you have. If you want to integrate your agent with number of external tools, CrewAI might be the best choice, as the set of out-of-the-box integrations is the biggest. However, LangGraph integrates well with LangChain, so if you are familiar with that ecosystem, it may suit you better.

All the frameworks have different approaches to building agents, so it's worth experimenting with all of them to see which one fits your needs the best. LangGraph and CrewAI are more mature and have more features, while AutoGen and OpenAI Swarm are more lightweight and more experimental. However, none of the existing frameworks solves all the mentioned Information Retrieval problems, so you still have to build your own tools to fill the gaps.

Building Agentic RAG with Qdrant

No matter which framework you choose, Qdrant is a great tool to build agentic RAG systems. Please check out our integrations to choose the best one for your use case and preferences. The easiest way to start using Qdrant is to use our managed service, Qdrant Cloud. A free 1GB cluster is
available for free, so you can start building your agentic RAG system in minutes.

Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance!

Kacper Łukawski — Thu, 29 Aug 2024 15:59:51 +0000

* At least any open-source model, since you need access to its internals.

You Can Adapt Dense Embedding Models for Late Interaction

Qdrant 1.10 introduced support for multi-vector representations, with late interaction being a prominent example of this model. In essence, both documents and queries are represented by multiple vectors, and identifying the most relevant documents involves calculating a score based on the similarity between corresponding query and document embeddings. If you're not familiar with this paradigm, our updated Hybrid Search article explains how multi-vector representations can enhance retrieval quality.

Figure 1: We can visualize late interaction between corresponding document-query embedding pairs.

There are many specialized late interaction models, such as ColBERT, but it appears that regular dense embedding models can also be effectively utilized in this manner.

In this study, we will demonstrate that standard dense embedding models, traditionally used for single-vector representations, can be effectively adapted for late interaction scenarios using output token embeddings as multi-vector representations.

By testing out retrieval with Qdrant’s multi-vector feature, we will show that these models can rival or surpass specialized late interaction models in retrieval performance, while offering lower complexity and greater efficiency. This work redefines the potential of dense models in advanced search pipelines, presenting a new method for optimizing retrieval systems.

Understanding Embedding Models

The inner workings of embedding models might be surprising to some. The model doesn’t operate directly on the input text; instead, it requires a tokenization step to convert the text into a sequence of token identifiers. Each token identifier is then passed through an embedding layer, which transforms it into a dense vector. Essentially, the embedding layer acts as a lookup table that maps token identifiers to dense vectors. These vectors are then fed into the transformer model as input.

Figure 2: The tokenization step, which takes place before vectors are added to the transformer model.

The input token embeddings are context-free and are learned during the model’s training process. This means that each token always receives the same embedding, regardless of its position in the text. At this stage, the token embeddings are unaware of the context in which they appear. It is the transformer model’s role to contextualize these embeddings.

Much has been discussed about the role of attention in transformer models, but in essence, this mechanism is responsible for capturing cross-token relationships. Each transformer module takes a sequence of token embeddings as input and produces a sequence of output token embeddings. Both sequences are of the same length, with each token embedding being enriched by information from the other token embeddings at the current step.

Figure 3: The mechanism which produces a sequence of output token embeddings.

Figure 4: The final step performed by the embedding model is pooling the output token embeddings to generate a single vector representation of the input text.

There are several pooling strategies, but regardless of which one a model uses, the output is always a single vector representation, which inevitably loses some information about the input. It’s akin to giving someone detailed, step-by-step directions to the nearest grocery store versus simply pointing in the general direction. While the vague direction might suffice in some cases, the detailed instructions are more likely to lead to the desired outcome.

Using Output Token Embeddings for Multi-Vector Representations

We often overlook the output token embeddings, but the fact is—they also serve as multi-vector representations of the input text. So, why not explore their use in a multi-vector retrieval model, similar to late interaction models?

Experimental Findings

We conducted several experiments to determine whether output token embeddings could be effectively used in place of traditional late interaction models. The results are quite promising.

Dataset	Model	Experiment	NDCG@10
SciFact	`prithivida/Splade_PP_en_v1`	sparse vectors	0.70928
	`colbert-ir/colbertv2.0`	late interaction model	0.69579
	`all-MiniLM-L6-v2`	single dense vector representation	0.64508
	`all-MiniLM-L6-v2`	output token embeddings	0.70724
	`BAAI/bge-small-en`	single dense vector representation	0.68213
	`BAAI/bge-small-en`	output token embeddings	0.73696

NFCorpus	`prithivida/Splade_PP_en_v1`	sparse vectors	0.34166
	`colbert-ir/colbertv2.0`	late interaction model	0.35036
	`all-MiniLM-L6-v2`	single dense vector representation	0.31594
	`all-MiniLM-L6-v2`	output token embeddings	0.35779
	`BAAI/bge-small-en`	single dense vector representation	0.29696
	`BAAI/bge-small-en`	output token embeddings	0.37502

ArguAna	`prithivida/Splade_PP_en_v1`	sparse vectors	0.47271
	`colbert-ir/colbertv2.0`	late interaction model	0.44534
	`all-MiniLM-L6-v2`	single dense vector representation	0.50167
	`all-MiniLM-L6-v2`	output token embeddings	0.45997
	`BAAI/bge-small-en`	single dense vector representation	0.58857
	`BAAI/bge-small-en`	output token embeddings	0.57648

The source code for these experiments is open-source and utilizes beir-qdrant, an integration of Qdrant with the BeIR library. While this package is not officially maintained by the Qdrant team, it may prove useful for those interested in experimenting with various Qdrant configurations to see how they impact retrieval quality. All experiments were conducted using Qdrant in exact search mode, ensuring the results are not influenced by approximate search.

Even the simple all-MiniLM-L6-v2 model can be applied in a late interaction model fashion, resulting in a positive impact on retrieval quality. However, the best results were achieved with the BAAI/bge-small-en model, which outperformed both sparse and late interaction models.

It's important to note that ColBERT has not been trained on BeIR datasets, making its performance fully out-of-domain. Nevertheless, the all-MiniLM-L6-v2 training dataset also lacks any BeIR data, yet it still performs remarkably well.

Comparative Analysis of Dense vs. Late Interaction Models

The retrieval quality speaks for itself, but there are other important factors to consider.

The traditional dense embedding models we tested are less complex than late interaction or sparse models. With fewer parameters, these models are expected to be faster during inference and more cost-effective to maintain. Below is a comparison of the models used in the experiments:

Model	Number of parameters
`prithivida/Splade_PP_en_v1`	109,514,298
`colbert-ir/colbertv2.0`	109,580,544
`BAAI/bge-small-en`	33,360,000
`all-MiniLM-L6-v2`	22,713,216

One argument against using output token embeddings is the increased storage requirements compared to ColBERT-like models. For instance, the all-MiniLM-L6-v2 model produces 384-dimensional output token embeddings, which is three times more than the 128-dimensional embeddings generated by ColBERT-like models. This increase not only leads to higher memory usage but also impacts the computational cost of retrieval, as calculating distances takes more time. Mitigating this issue through vector compression would make a lot of sense.

Exploring Quantization for Multi-Vector Representations

Binary quantization is generally more effective for high-dimensional vectors, making the all-MiniLM-L6-v2 model, with its relatively low-dimensional outputs, less ideal for this approach. However, scalar quantization appeared to be a viable alternative. The table below summarizes the impact of quantization on retrieval quality.

Dataset	Model	Experiment	NDCG@10
SciFact	`all-MiniLM-L6-v2`	output token embeddings	0.70724
SciFact	`all-MiniLM-L6-v2`	output token embeddings (uint8)	0.70297

NFCorpus	`all-MiniLM-L6-v2`	output token embeddings	0.35779
NFCorpus	`all-MiniLM-L6-v2`	output token embeddings (uint8)	0.35572

It’s important to note that quantization doesn’t always preserve retrieval quality at the same level, but in this case, scalar quantization appears to have minimal impact on retrieval performance. The effect is negligible, while the memory savings are substantial.

We managed to maintain the original quality while using four times less memory. Additionally, a quantized vector requires 384 bytes, compared to ColBERT’s 512 bytes. This results in a 25% reduction in memory usage, with retrieval quality remaining nearly unchanged.

Practical Application: Enhancing Retrieval with Dense Models

If you’re using one of the sentence transformer models, the output token embeddings are calculated by default. While a single vector representation is more efficient in terms of storage and computation, there’s no need to discard the output token embeddings. According to our experiments, these embeddings can significantly enhance retrieval quality. You can store both the single vector and the output token embeddings in Qdrant, using the single vector for the initial retrieval step and then reranking the results with the output token embeddings.

Figure 5: A single model pipeline that relies solely on the output token embeddings for reranking.

To demonstrate this concept, we implemented a simple reranking pipeline in Qdrant. This pipeline uses a dense embedding model for the initial oversampled retrieval and then relies solely on the output token embeddings for the reranking step.

Single Model Retrieval and Reranking Benchmarks

Our tests focused on using the same model for both retrieval and reranking. The reported metric is NDCG@10. In all tests, we applied an oversampling factor of 5x, meaning the retrieval step returned 50 results, which were then narrowed down to 10 during the reranking step. Below are the results for some of the BeIR datasets:

Dataset	`all-miniLM-L6-v2`		`BAAI/bge-small-en`
Dataset	dense embeddings only	dense + reranking	dense embeddings only	dense + reranking
SciFact	0.64508	0.70293	0.68213	0.73053
NFCorpus	0.31594	0.34297	0.29696	0.35996
ArguAna	0.50167	0.45378	0.58857	0.57302
Touche-2020	0.16904	0.19693	0.13055	0.19821
TREC-COVID	0.47246	0.6379	0.45788	0.53539
FiQA-2018	0.36867	0.41587	0.31091	0.39067

The source code for the benchmark is publicly available, and you can find it in the repository of the beir-qdrant package.

Overall, adding a reranking step using the same model typically improves retrieval quality. However, the quality of various late interaction models is often reported based on their reranking performance when BM25 is used for the initial retrieval. This experiment aimed to demonstrate how a single model can be effectively used for both retrieval and reranking, and the results are quite promising.

Now, let's explore how to implement this using the new Query API introduced in Qdrant 1.10.

Implementation Guide: Setting Up Qdrant for Late Interaction

The new Query API in Qdrant 1.10 enables the construction of even more complex retrieval pipelines. We can use the single vector created after pooling for the initial retrieval step and then rerank the results using the output token embeddings.

Assuming the collection is named my-collection and is configured to store two named vectors: dense-vector and output-token-embeddings, here’s how such a collection could be created in Qdrant:

from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="my-collection",
    vectors_config={
        "dense-vector": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
        ),
        "output-token-embeddings": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
        ),
    }
)

Both vectors are of the same size since they are produced by the same all-MiniLM-L6-v2 model.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

Now, instead of using the search API with just a single dense vector, we can create a reranking pipeline. First, we retrieve 50 results using the dense vector, and then we rerank them using the output token embeddings to obtain the top 10 results.

query = "What else can be done with just all-MiniLM-L6-v2 model?"

client.query_points(
    collection_name="my-collection",
    prefetch=[
        # Prefetch the dense embeddings of the top-50 documents
        models.Prefetch(
            query=model.encode(query).tolist(),
            using="dense-vector",
            limit=50,
        )
    ],
    # Rerank the top-50 documents retrieved by the dense embedding model
    # and return just the top-10. Please note we call the same model, but
    # we ask for the token embeddings by setting the output_value parameter.
    query=model.encode(query, output_value="token_embeddings").tolist(),
    using="output-token-embeddings",
    limit=10,
)

Try the Experiment Yourself

In a real-world scenario, you might take it a step further by first calculating the token embeddings and then performing pooling to obtain the single vector representation. This approach allows you to complete everything in a single pass.

The simplest way to start experimenting with building complex reranking pipelines in Qdrant is by using the forever-free cluster on Qdrant Cloud and reading Qdrant's documentation.

The source code for these experiments is open-source and uses beir-qdrant, an integration of Qdrant with the BeIR library.

Future Directions and Research Opportunities

The initial experiments using output token embeddings in the retrieval process have yielded promising results. However, we plan to conduct further benchmarks to validate these findings and explore the incorporation of sparse methods for the initial retrieval. Additionally, we aim to investigate the impact of quantization on multi-vector representations and its effects on retrieval quality. Finally, we will assess retrieval speed, a crucial factor for many applications.

What is Hybrid Search?

Kacper Łukawski — Tue, 06 Feb 2024 15:33:52 +0000

There is not a single definition of hybrid search. Actually, if we use more than one search algorithm, it might be described as some sort of hybrid. Some of the most popular definitions are:

A combination of vector search with attribute filtering. We won't dive much into details, as we like to call it just filtered vector search.
Vector search with keyword-based search. This one is covered in this article.
A mix of dense and sparse vectors. That strategy will be covered in the upcoming article.

Why do we still need keyword search?

A keyword-based search was the obvious choice for search engines in the past. It struggled with some common issues, but since we didn't have any alternatives, we had to overcome them with additional preprocessing of the documents and queries.

Vector search turned out to be a breakthrough, as it has some clear advantages in the following scenarios:

🌍 Multi-lingual & multi-modal search
🤔 For short texts with typos and ambiguous content-dependent meanings
👨‍🔬 Specialized domains with tuned encoder models
📄 Document-as-a-Query similarity search

It doesn't mean we do not keyword search anymore. There are also some cases in which this kind of method might be useful:

🌐💭 Out-of-domain search. Words are just words, no matter what they mean. BM25 ranking represents the universal property of the natural language - less frequent words are more important, as they carry most of the meaning.
⌨️💨 Search-as-you-type, when there are only a few characters types in, and we cannot use vector search yet.
🎯🔍 Exact phrase matching when we want to find the occurrences of a specific term in the documents. That's especially useful for names of the products, people, part numbers, etc.

Matching the tool to the task

There are various cases in which we need search capabilities and each of those cases will have some different requirements. Therefore, there is not just one strategy to rule them all, and some different tools may fit us better. Text search itself might be roughly divided into multiple specializations like:

Web-scale search - documents retrieval
Fast search-as-you-type
Search over less-than-natural texts (logs, transactions, code, etc.)

Each of those scenarios has a specific tool, which performs better for that specific use case. If you already expose search capabilities, then you probably have one of them in your tech stack. And we can easily combine those tools with vector search to get the best of both worlds.

The fast search: A Fallback strategy

The easiest way to incorporate vector search into the existing stack is to treat it as some sort of fallback strategy. So whenever your keyword search struggle with finding proper results, you can run a semantic search to extend the results. That is especially important in cases like search-as-you-type in which a new query is fired every single time your user types the next character in.

For such cases the speed of the search is crucial. Therefore, we can't use vector search on every query. At the same
time, the simple prefix search might have a bad recall.

In this case, a good strategy is to use vector search only when the keyword/prefix search returns none or just a small number of results. A good candidate for this is MeiliSearch.
It uses custom ranking rules to provide results as fast as the user can type.

The pseudocode of such strategy may go as following:

async def search(query: str):
    # Get fast results from MeiliSearch
    keyword_search_result = search_meili(query)

    # Check if there are enough results
    # or if the results are good enough for given query
    if are_results_enough(keyword_search_result, query):
        return keyword_search

    # Encoding takes time, but we get more results
    vector_query = encode(query)

    vector_result = search_qdrant(vector_query)
    return vector_result

The precise search: The re-ranking strategy

In the case of document retrieval, we care more about the search result quality and time is not a huge constraint.
There is a bunch of search engines that specialize in the full-text search we found interesting:

Tantivy - a full-text indexing library written in Rust. Has a great performance and featureset.
lnx - a young but promising project, utilizes Tanitvy as a backend.
ZincSearch - a project written in Go, focused on minimal resource usage and high performance.
Sonic - a project written in Rust, uses custom network communication protocol for fast communication between the client and the server.

All of those engines might be easily used in combination with the vector search offered by Qdrant. But the exact way how to combine the results of both algorithms to achieve the best search precision might be still unclear. So we need to understand how to do it effectively. We will be using reference datasets to benchmark the search quality.

Why not linear combination?

It's often proposed to use full-text and vector search scores to form a linear combination formula to rerank the results. So it goes like this:

final_score = 0.7 * vector_score + 0.3 * full_text_score

However, we didn't even consider such a setup. Why? Those scores don't make the problem linearly separable. We used BM25 score along with cosine vector similarity to use both of them as points coordinates in 2-dimensional space. The chart shows how those points are distributed:

A distribution of both Qdrant and BM25 scores mapped into 2D space. It clearly shows relevant and non-relevant objects are not linearly separable in that space, so using a linear combination of both scores won't give us a proper hybrid search.

Both relevant and non-relevant items are mixed. None of the linear formulas would be able to distinguish between them. Thus, that's not the way to solve it.

How to approach re-ranking?

There is a common approach to re-rank the search results with a model that takes some additional factors into account. Those models are usually trained on clickstream data of a real application and tend to be very business-specific. Thus, we'll not cover them right now, as there is a more general approach. We will
use so-called cross-encoder models.

Cross-encoder takes a pair of texts and predicts the similarity of them. Unlike embedding models, cross-encoders do not compress text into vector, but uses interactions between individual tokens of both texts. In general, they are more powerful than both BM25 and vector search, but they are also way slower. That makes it feasible to use cross-encoders only for re-ranking of some preselected candidates.

This is how a pseudocode for that strategy look like:

async def search(query: str):
    keyword_search = search_keyword(query)
    vector_search = search_qdrant(query) 
    all_results = await asyncio.gather(keyword_search, vector_search)  # parallel calls
    rescored = cross_encoder_rescore(query, all_results)
    return rescored

It is worth mentioning that queries to keyword search and vector search and re-scoring can be done in parallel.
Cross-encoder can start scoring results as soon as the fastest search engine returns the results.

Experiments

For that benchmark, there have been 3 experiments conducted:

Vector search with Qdrant

All the documents and queries are vectorized with all-MiniLM-L6-v2
model, and compared with cosine similarity.

Keyword-based search with BM25

All the documents are indexed by BM25 and queried with its default configuration.

Vector and keyword-based candidates generation and cross-encoder reranking

Both Qdrant and BM25 provides N candidates each and
ms-marco-MiniLM-L-6-v2 cross encoder performs reranking
on those candidates only. This is an approach that makes it possible to use the power of semantic and keyword based
search together.

Quality metrics

There are various ways of how to measure the performance of search engines, and Recommender Systems: Machine Learning Metrics and Business Metrics is a great introduction to that topic.
I selected the following ones:

NDCG@5, NDCG@10
DCG@5, DCG@10
MRR@5, MRR@10
Precision@5, Precision@10
Recall@5, Recall@10

Since both systems return a score for each result, we could use DCG and NDCG metrics. However, BM25 scores are not normalized be default. We performed the normalization to a range [0, 1] by dividing each score by the maximum score returned for that query.

Datasets

There are various benchmarks for search relevance available. Full-text search has been a strong baseline for most of them. However, there are also cases in which semantic search works better by default. For that article, I'm performing zero shot search, meaning our models didn't have any prior exposure to the benchmark datasets, so this is effectively an out-of-domain search.

Home Depot

Home Depot dataset consists of real inventory and search queries from Home Depot's website with a relevancy score from 1 (not relevant) to 3 (highly relevant).

Anna Montoya, RG, Will Cukierski. (2016). Home Depot Product Search Relevance. Kaggle. 
https://kaggle.com/competitions/home-depot-product-search-relevance

There are over 124k products with textual descriptions in the dataset and around 74k search queries with the relevancy
score assigned. For the purposes of our benchmark, relevancy scores were also normalized.

WANDS

I also selected a relatively new search relevance dataset. WANDS, which stands for Wayfair ANnotation Dataset, is designed to evaluate search engines for e-commerce.

WANDS: Dataset for Product Search Relevance Assessment
Yan Chen, Shujian Liu, Zheng Liu, Weiyi Sun, Linas Baltrunas and Benjamin Schroeder

In a nutshell, the dataset consists of products, queries and human annotated relevancy labels. Each product has various textual attributes, as well as facets. The relevancy is provided as textual labels: “Exact”, “Partial” and “Irrelevant” and authors suggest to convert those to 1, 0.5 and 0.0 respectively. There are 488 queries with a varying number of relevant items each.

The results

Both datasets have been evaluated with the same experiments. The achieved performance is shown in the tables.

Home Depot

The results achieved with BM25 alone are better than with Qdrant only. However, if we combine both methods into hybrid search with an additional cross encoder as a last step, then that gives great improvement over any baseline method.

With the cross-encoder approach, Qdrant retrieved about 56.05% of the relevant items on average, while BM25 fetched 59.16%. Those numbers don't sum up to 100%, because some items were returned by both systems.

WANDS

The dataset seems to be more suited for semantic search, but the results might be also improved if we decide to use a hybrid search approach with cross encoder model as a final step.

Overall, combining both full-text and semantic search with an additional reranking step seems to be a good idea, as we are able to benefit the advantages of both methods.

Again, it's worth mentioning that with the 3rd experiment, with cross-encoder reranking, Qdrant returned more than 48.12% of the relevant items and BM25 around 66.66%.

Some anecdotal observations

None of the algorithms works better in all the cases. There might be some specific queries in which keyword-based search will be a winner and the other way around. The table shows some interesting examples we could find in WANDS dataset during the experiments:

Query	BM25 Search	Vector Search
cybersport desk	desk ❌	gaming desk ✅
plates for icecream	"eat" plates on wood wall décor ❌	alicyn 8.5 '' melamine dessert plate ✅
kitchen table with a thick board	craft kitchen acacia wood cutting board ❌	industrial solid wood dining table ✅
wooden bedside table	30 '' bedside table lamp ❌	portable bedside end table ✅

Also examples where keyword-based search did better:

Query	BM25 Search	Vector Search
computer chair	vibrant computer task chair ✅	office chair ❌
64.2 inch console table	cervantez 64.2 '' console table ✅	69.5 '' console table ❌

A wrap up

Each search scenario requires a specialized tool to achieve the best results possible. Still, combining multiple tools with minimal overhead is possible to improve the search precision even further. Introducing vector search into an existing search stack doesn't need to be a revolution but just one small step at a time.

You'll never cover all the possible queries with a list of synonyms, so a full-text search may not find all the relevant documents. There are also some cases in which your users use different terminology than the one you have in your database.

Those problems are easily solvable with neural vector embeddings, and combining both approaches with an additional reranking step is possible. So you don't need to resign from your well-known full-text search mechanism but extend it with vector search to support the queries you haven't foreseen.

DEV Community: Kacper Łukawski

Context Engineering for Agentic Systems: What Goes Into Your Agent's Mind

Why context is so important for agentic systems

When less is more

The cost dimension

What fills the context window in an agentic system

What the context actually looks like

Building a Haystack agent

Strategies for managing context growth

Delegation to subagents

Improving retrieval quality

Summarisation and compaction

Adding only relevant tools to the context

Offloading notes (scratchpad / working memory)

What's coming next in this series

What is Agentic RAG? Building Agents with Qdrant

Agentic RAG: Combining RAG with Agents

Solving Information Retrieval Problems with LLMs

Where are Agents Used?

Which Framework is Best?

LangGraph

CrewAI

AutoGen

OpenAI Swarm

The winner?

Building Agentic RAG with Qdrant

Further Reading

Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance!

You Can Adapt Dense Embedding Models for Late Interaction

Understanding Embedding Models

Using Output Token Embeddings for Multi-Vector Representations

Experimental Findings

Comparative Analysis of Dense vs. Late Interaction Models

Exploring Quantization for Multi-Vector Representations

Practical Application: Enhancing Retrieval with Dense Models

Single Model Retrieval and Reranking Benchmarks

Implementation Guide: Setting Up Qdrant for Late Interaction

Try the Experiment Yourself

Future Directions and Research Opportunities

What is Hybrid Search?

Why do we still need keyword search?

Matching the tool to the task

The fast search: A Fallback strategy

The precise search: The re-ranking strategy

Why not linear combination?

How to approach re-ranking?

Experiments

Quality metrics

Datasets

Home Depot

WANDS

The results

Home Depot

WANDS

Some anecdotal observations

A wrap up