<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kacper Łukawski</title>
    <description>The latest articles on DEV Community by Kacper Łukawski (@kacperlukawski).</description>
    <link>https://dev.to/kacperlukawski</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1274658%2F66be4aa5-e484-4a2c-b9a4-383273c9b1ef.jpeg</url>
      <title>DEV Community: Kacper Łukawski</title>
      <link>https://dev.to/kacperlukawski</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kacperlukawski"/>
    <language>en</language>
    <item>
      <title>What is Agentic RAG? Building Agents with Qdrant</title>
      <dc:creator>Kacper Łukawski</dc:creator>
      <pubDate>Mon, 25 Nov 2024 18:19:49 +0000</pubDate>
      <link>https://dev.to/qdrant/what-is-agentic-rag-building-agents-with-qdrant-2pm0</link>
      <guid>https://dev.to/qdrant/what-is-agentic-rag-building-agents-with-qdrant-2pm0</guid>
      <description>&lt;p&gt;Standard &lt;a href="https://dev.to/articles/what-is-rag-in-ai/"&gt;Retrieval Augmented Generation&lt;/a&gt; follows a predictable, linear path: receive a query, retrieve relevant documents, and generate a response. In many cases that might be enough to solve a particular problem. In the worst case scenario, your LLM will just decide to not answer the question, because the context does not provide enough information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryfoc1hqtjy5u2lkplot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryfoc1hqtjy5u2lkplot.png" alt="Standard, linear RAG pipeline" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the other hand, we have agents. These systems are given more freedom to act, and can take multiple non-linear steps to achieve a certain goal. There isn't a single definition of what an agent is, but in general, it is an applicationthat uses LLM and usually some tools to communicate with the outside world. &lt;/p&gt;

&lt;p&gt;LLMs are used as decision-makers which decide what action to take next. Actions can be anything, but they are usually well-defined and limited to a certain set of possibilities. One of these actions might be to query a vector database, like Qdrant, to retrieve relevant documents, if the context is not enough to make a decision. &lt;/p&gt;

&lt;p&gt;However, RAG is just a single tool in the agent's arsenal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1acyiwpo6m8csvlajuy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1acyiwpo6m8csvlajuy1.png" alt="AI Agent" width="800" height="682"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Agentic RAG: Combining RAG with Agents
&lt;/h2&gt;

&lt;p&gt;Since the agent definition is vague, the concept of &lt;strong&gt;Agentic RAG&lt;/strong&gt; is also not well-defined. In general, it refers to the combination of RAG with agents. This allows the agent to use external knowledge sources to make decisions, and primarily to decide when the external knowledge is needed. &lt;/p&gt;

&lt;p&gt;We can describe a system as Agentic RAG if it breaks the &lt;br&gt;
linear flow of a standard RAG system, and gives the agent the ability to take multiple steps to achieve a goal.&lt;/p&gt;

&lt;p&gt;A simple router that chooses a path to follow is often described as the simplest form of an agent. Such a system has multiple paths with conditions describing when to take a certain path. In the context of Agentic RAG, the agent can decide to query a vector database if the context is not enough to answer, or skip the query if it's enough, or when the question refers to common knowledge. &lt;/p&gt;

&lt;p&gt;Alternatively, there might be multiple collections storing different kinds of information, and the agent can decide which collection to query based on the context. The key factor is that the decision of choosing a path is made by the LLM, which is the core of the agent. A routing agent never comes back to the previous step, so it's ultimately just a conditional decision-making system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffd5b7o9qmvo1a8wy6rvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffd5b7o9qmvo1a8wy6rvh.png" alt="Routing Agent" width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, routing is just the beginning. Agents can be much more complex, and extreme forms of agents can have completefreedom to act. In such cases, the agent is given a set of tools and can autonomously decide which ones to use, how to use them, and in which order. LLMs are asked to plan and execute actions, and the agent can take multiple steps to achieve a goal, including taking steps back if needed. Such a system does not have to follow a DAG structure (Directed Acyclic Graph), and can have loops that help to self-correct the decisions made in the past. &lt;/p&gt;

&lt;p&gt;An agentic RAG system built in that manner can have tools not only to query a vector database, but also to play with the query, summarize the&lt;br&gt;
results, or even generate new data to answer the question. Options are endless, but there are some common patterns that can be observed in the wild. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fie282zphjgn5nffkgkyn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fie282zphjgn5nffkgkyn.png" alt="Autonomous Agent" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Solving Information Retrieval Problems with LLMs
&lt;/h3&gt;

&lt;p&gt;Generally speaking, tools exposed in an agentic RAG system are used to solve information retrieval problems which are not new to the search community. LLMs have changed how we approach these problems, but the core of the problem remains the same. What kind of tools you can consider using in an agentic RAG? Here are some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Querying a vector database&lt;/strong&gt; - the most common tool used in agentic RAG systems. It allows the agent to retrieve relevant documents based on the query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query expansion&lt;/strong&gt; - a tool that can be used to improve the query. It can be used to add synonyms, correct typos, or even to generate new queries based on the original one.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s73cptbqh6xbol73lod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s73cptbqh6xbol73lod.png" alt="Query expansion example" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extracting filters&lt;/strong&gt; - vector search alone is sometimes not enough. In many cases, you might want to narrow down the results based on specific parameters. This extraction process can automatically identify relevant conditions from the query. Otherwise, your users would have to manually define these search constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyonl9p8qivfuqwtg3bx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyonl9p8qivfuqwtg3bx.png" alt="Extracting filters" width="800" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality judgement&lt;/strong&gt; - knowing the quality of the results for given query can be used to decide whether they are good enough to answer, or if the agent should take another step to improve them somehow. Alternatively it can also admit the failure to provide good response.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvpzgubn4ecxd66uvn81.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvpzgubn4ecxd66uvn81.png" alt="Quality judgement" width="800" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are just some of the examples, but the list is not exhaustive. For example, your LLM could possibly play with Qdrant search parameters or choose different methods to query it. An example? If your users are searching using some specific keywords, you may prefer sparse vectors to dense vectors, as they are more efficient in such cases. In that case you have to arm your agent with tools to decide when to use sparse vectors and when to use dense vectors. Agentaware of the collection structure can make such decisions easily.&lt;/p&gt;

&lt;p&gt;Each of these tools might be a separate agent on its own, and multi-agent systems are not uncommon. In such cases, agents can communicate with each other, and one agent can decide to use another agent to solve a particular problem.&lt;/p&gt;

&lt;p&gt;Pretty useful component of an agentic RAG is also a human in the loop, which can be used to correct the agent's decisions, or steer it in the right direction.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where are Agents Used?
&lt;/h2&gt;

&lt;p&gt;Agents are an interesting concept, but since they heavily rely on LLMs, they are not applicable to all problems. Using Large Language Models is expensive and tend to be slow, what in many cases, it's not worth the cost. Standard RAG involves just a single call to the LLM, and the response is generated in a predictable way. Agents, on the other hand, can take multiple steps, and the latency experienced by the user adds up. &lt;/p&gt;

&lt;p&gt;In many cases, it's not acceptable. Agentic RAG is probably not that widely applicable in ecommerce search, where the user expects a quick response, but might be fine for customer support, where the user is willing to wait a bit longer for a better answer.&lt;/p&gt;
&lt;h2&gt;
  
  
  Which Framework is Best?
&lt;/h2&gt;

&lt;p&gt;There are lots of frameworks available to build agents, and choosing the best one is not easy. It depends on your existing stack or the tools you are familiar with. Some of the most popular LLM libraries have already drifted towards the agent paradigm, and they are offering tools to build them. There are, however, some tools built primarily for agents development, so let's focus on them.&lt;/p&gt;
&lt;h3&gt;
  
  
  LangGraph
&lt;/h3&gt;

&lt;p&gt;Developed by the LangChain team, LangGraph seems like a natural extension for those who already use LangChain for building their RAG systems, and would like to start with agentic RAG. &lt;/p&gt;

&lt;p&gt;Surprisingly, LangGraph has nothing to do with Large Language Models on its own. It's a framework for building graph-based applications in which each &lt;strong&gt;node&lt;/strong&gt; is a step of the workflow. Each node takes an application &lt;strong&gt;state&lt;/strong&gt; as an input, and produces a modified state as an output. The state is then passed to the next node, and so on. &lt;strong&gt;Edges&lt;/strong&gt; between the nodes might be conditional what makes branching possible. Contrary to some DAG-based tool (i.e. Apache Airflow), LangGraph allows for loops in the graph, which makes it possible to implement cyclic workflows, so an agent can achieve self-reflection and self-correction. Theoretically, LangGraph can be used to build any kind of applications in a graph-based manner, not only LLM agents.&lt;/p&gt;

&lt;p&gt;Some of the strengths of LangGraph include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistence&lt;/strong&gt; - the state of the workflow graph is stored as a checkpoint. That happens at each so-called super-step (which is a single sequential node of a graph). It enables replying certain steps of the workflow, fault-tolerance, and including human-in-the-loop interactions. This mechanism also acts as a &lt;strong&gt;short-term memory&lt;/strong&gt;, accessible in a context of a particular workflow execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term memory&lt;/strong&gt; - LangGraph also has a concept of memories that are shared between different workflow runs. However, this mechanism has to explicitly handled by our nodes. &lt;strong&gt;Qdrant with its semantic search capabilities is often used as a long-term memory layer&lt;/strong&gt;. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent support&lt;/strong&gt; - while there is no separate concept of multi-agent systems in LangGraph, it's possible to create such an architecture by building a graph that includes multiple agents and some kind of supervisor that makes a decision which agent to use in a given situation. If a node might be anything, then it might be another agent as well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some other interesting features of LangGraph include the ability to visualize the graph, automate the retries of failed steps, and include human-in-the-loop interactions.&lt;/p&gt;

&lt;p&gt;A minimal example of an agentic RAG could improve the user query, e.g. by fixing typos, expanding it with synonyms, or even generating a new query based on the original one. The agent could then retrieve documents from a vector database based on the improved query, and generate a response. The LangGraph app implementing this approach could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing_extensions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.constants&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# The state of the agent includes at least the messages exchanged between the agent(s) 
&lt;/span&gt;    &lt;span class="c1"&gt;# and the user. It is, however, possible to include other information in the state, as 
&lt;/span&gt;    &lt;span class="c1"&gt;# it depends on the specific agent.
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseMessage&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;improve_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# Building a graph requires defining nodes and building the flow between them with edges.
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;improve_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;improve_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;improve_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;improve_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Compiling the graph performs some checks and prepares the graph for execution.
&lt;/span&gt;&lt;span class="n"&gt;compiled_graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Compiled graph might be invoked with the initial state to start.
&lt;/span&gt;&lt;span class="n"&gt;compiled_graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why Qdrant is the best vector database out there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node of the process is just a Python function that does certain operation. You can call an LLM of your choice inside of them, if you want to, but there is no assumption about the messages being created by any AI. &lt;strong&gt;LangGraph rather acts as a runtime that launches these functions in a specific order, and passes the state between them&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;While &lt;a href="https://www.langchain.com/langgraph" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt; integrates well with the LangChain ecosystem, it can be used independently. For teams looking for additional support and features, there's also a commercial offering called LangGraph Platform. The framework is available for both Python and JavaScript environments, making it possible to be used in different tech stacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  CrewAI
&lt;/h3&gt;

&lt;p&gt;CrewAI is another popular choice for building agents, including agentic RAG. It's a high-level framework that assumesthere are some LLM-based agents working together to achieve a common goal. That's where the "crew" in CrewAI comes from. CrewAI is designed with multi-agent systems in mind. Contrary to LangGraph, the developer does not create a graph of &lt;br&gt;
processing, but defines agents and their roles within the crew.&lt;/p&gt;

&lt;p&gt;Some of the key concepts of CrewAI include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent&lt;/strong&gt; - a unit that has a specific role and goal, controlled by an LLM. It can optionally use some external tools to communicate with the outside world, but generally steered by prompt we provide to the LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process&lt;/strong&gt; - currently either sequential or hierarchical. It defines how the task will be executed by the agents. In a sequential process, agents are executed one after another, while in a hierarchical process, agent is selected by the manager agent, which is responsible for making decisions about which agent to use in a given situation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roles and goals&lt;/strong&gt; - each agent has a certain role within the crew, and the goal it should aim to achieve. These are set when we define an agent and are used to make decisions about which agent to use in a given situation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; - an extensive memory system consists of short-term memory, long-term memory, entity memory, and contextual memory that combines the other three. There is also user memory for preferences and personalization. &lt;strong&gt;This is where Qdrant comes into play, as it might be used as a long-term memory layer.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CrewAI provides a rich set of tools integrated into the framework. That may be a huge advantage for those who want to combine RAG with e.g. code execution, or image generation. The ecosystem is rich, however brining your own tools is not a big deal, as CrewAI is designed to be extensible.&lt;/p&gt;

&lt;p&gt;A simple agentic RAG application implemented in CrewAI could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai.memory.entity.entity_memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EntityMemory&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai.memory.short_term.short_term_memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ShortTermMemory&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai.memory.storage.rag_storage&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RAGStorage&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QdrantStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RAGStorage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;response_generator_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate response based on the conversation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Provide the best response, or admit when the response is not available.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I am a response generator agent. I generate &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;responses based on the conversation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;query_reformulation_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reformulate the query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rewrite the query to get better results. Fix typos, grammar, word choice, etc.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I am a query reformulation agent. I reformulate the &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query to get better results.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Let me know why Qdrant is the best vector database out there.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3 bullet points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;crew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;response_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_reformulation_agent&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;entity_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;EntityMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;QdrantStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;short_term_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ShortTermMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;QdrantStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short-term&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kickoff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Disclaimer: QdrantStorage is not a part of the CrewAI framework, but it's taken from the Qdrant documentation on &lt;a href="https://qdrant.tech/documentation/frameworks/crewai/" rel="noopener noreferrer"&gt;how to integrate Qdrant with CrewAI&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Although it's not a technical advantage, CrewAI has a &lt;a href="https://docs.crewai.com/introduction" rel="noopener noreferrer"&gt;great documentation&lt;/a&gt;. The framework is available for Python, and it's easy to get started with it. CrewAI also has a commercial offering, CrewAI Enterprise, which provides a platform for building and deploying agents at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  AutoGen
&lt;/h3&gt;

&lt;p&gt;AutoGen emphasizes multi-agent architectures as a fundamental design principle. The framework requires at least two agents in any system to really call an application agentic - typically an assistant and a user proxy exchange messages to achieve a common goal. Sequential chat with more than two agents is also supported, as well as group chat and nested&lt;br&gt;
chat for internal dialogue. However, AutoGen does not assume there is a structured state that is passed between the agents, and the chat conversation is the only way to communicate between them.&lt;/p&gt;

&lt;p&gt;There are many interesting concepts in the framework, some of them even quite unique:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools/functions&lt;/strong&gt; - external components that can be used by agents to communicate with the outside world. They are defined as Python callables, and can be used for any external interaction we want to allow the agent to do. Type annotations are used to define the input and output of the tools, and Pydantic models are supported for more complex type schema. AutoGen supports only OpenAI-compatible tool call API for the time being.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code executors&lt;/strong&gt; - built-in code executors include local command, Docker command, and Jupyter. An agent can write and launch code, so theoretically the agents can do anything that can be done in Python. None of the other frameworks made code generation and execution that prominent. Code execution being the first-class citizen in AutoGen is an interesting concept.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each AutoGen agent uses at least one of the components: human-in-the-loop, code executor, tool executor, or LLM. A simple agentic RAG, based on the conversation of two agents which can retrieve documents from a vector database, or improve the query, could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;environ&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;autogen&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConversableAgent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;autogen.agentchat.contrib.retrieve_user_proxy_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RetrieveUserProxyAgent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;

&lt;span class="n"&gt;response_generator_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConversableAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_generator_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You answer user questions based solely on the provided context. You ask to retrieve relevant documents for &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your query, or reformulate the query, if it is incorrect in some way.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A response generator agent that can answer your queries.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}]},&lt;/span&gt;
    &lt;span class="n"&gt;human_input_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEVER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;user_proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RetrieveUserProxyAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval_user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}]},&lt;/span&gt;
    &lt;span class="n"&gt;human_input_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEVER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retrieve_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_token_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector_db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_or_create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_proxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initiate_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;response_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_proxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message_generator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;problem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why Qdrant is the best vector database out there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For those new to agent development, AutoGen offers AutoGen Studio, a low-code interface for prototyping agents. While not intended for production use, it significantly lowers the barrier to entry for experimenting with agent architectures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffl4f0daj2zg8tqpcy5rd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffl4f0daj2zg8tqpcy5rd.png" alt="AutoGen Studio" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's worth noting that AutoGen is currently undergoing significant updates, with version 0.4.x in development introducing substantial API changes compared to the stable 0.2.x release. While the framework currently has limited built-in persistence and state management capabilities, these features may evolve in future releases.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI Swarm
&lt;/h3&gt;

&lt;p&gt;Unliked the other frameworks described in this article, OpenAI Swarm is an educational project, and it's not ready for production use. It's worth mentioning, though, as it's pretty lightweight and easy to get started with. OpenAI Swarm is an experimental framework for orchestrating multi-agent workflows that focuses on agent coordination through direct handoffs rather than complex orchestration patterns.&lt;/p&gt;

&lt;p&gt;With that setup, &lt;strong&gt;agents&lt;/strong&gt; are just exchanging messages in a chat, optionally calling some Python functions to communicate with external services, or handing off the conversation to another agent, if the other one seems to be more suitable to answer the question. Each agent has a certain role, defined by the instructions we have to define.&lt;/p&gt;

&lt;p&gt;We have to decide which LLM will a particular agent use, and a set of functions it can call. For example, &lt;strong&gt;a retrieval agent could use a vector database to retrieve documents&lt;/strong&gt;, and return the results to the next agent. That means, there should be a function that performs the semantic search on its behalf, but the model will decide how the query should look like.&lt;/p&gt;

&lt;p&gt;Here is how a similar agentic RAG application, implemented in OpenAI Swarm, could look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;swarm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Swarm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Swarm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Retrieve documents based on the query.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transfer_to_query_improve_agent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;query_improve_agent&lt;/span&gt;

&lt;span class="n"&gt;query_improve_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query Improve Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a search expert that takes user queries and improves them to get better results. You fix typos and &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extend queries with synonyms, if needed. You never ask the user for more information.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response_generation_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Generation Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You take the whole conversation and generate a final response based on the chat history. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have enough information, you can retrieve the documents from the knowledge base or &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reformulate the query by transferring to other agent. You never ask the user for more information. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You have to always be the last participant of each conversation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;functions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transfer_to_query_improve_agent&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response_generation_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why Qdrant is the best vector database out there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though we don't explicitly define the graph of processing, the agents can still decide to hand off the processing to a different agent. There is no concept of a state, so everything relies on the messages exchanged between different components. &lt;/p&gt;

&lt;p&gt;OpenAI Swarm does not focus on integration with external tools, and &lt;strong&gt;if you would like to integrate semantic search with Qdrant, you would have to implement it fully yourself&lt;/strong&gt;. Obviously, the library is tightly coupled with OpenAI models, and while using some other ones is possible, it requires some additional work like setting up proxy that will &lt;br&gt;
adjust the interface to OpenAI API.&lt;/p&gt;

&lt;h3&gt;
  
  
  The winner?
&lt;/h3&gt;

&lt;p&gt;Choosing the best framework for your agentic RAG system depends on your existing stack, team expertise, and the specific requirements of your project. All the described tools are strong contenders, and they are developed at rapid pace. It's worth keeping an eye on all of them, as they are likely to evolve and improve over time. Eventually, you should be able to build the same processes with any of them, but some of them may be more suitable in a specific ecosystem of the tools you want your agent to interact with.&lt;/p&gt;

&lt;p&gt;There are, however, some important factors to consider when choosing a framework for your agentic RAG system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop&lt;/strong&gt; - even though we aim to build autonomous agents, it's often important to include the feedback from the human, so our agents cannot perform malicious actions. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; - how easy it is to debug the system, and how easy it is to understand what's happening inside. Especially important, since we are dealing with lots of LLM prompts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Still, choosing the right toolkit depends on the state of your project, and the specific requirements you have. If you want to integrate your agent with number of external tools, CrewAI might be the best choice, as the set of out-of-the-box integrations is the biggest. However, LangGraph integrates well with LangChain, so if you are familiar with that ecosystem, it may suit you better. &lt;/p&gt;

&lt;p&gt;All the frameworks have different approaches to building agents, so it's worth experimenting with all of them to see which one fits your needs the best. LangGraph and CrewAI are more mature and have more features, while AutoGen and OpenAI Swarm are more lightweight and more experimental. However, &lt;strong&gt;none of the existing frameworks solves all the mentioned Information Retrieval problems&lt;/strong&gt;, so you still have to build your own tools to fill the gaps. &lt;/p&gt;

&lt;h2&gt;
  
  
  Building Agentic RAG with Qdrant
&lt;/h2&gt;

&lt;p&gt;No matter which framework you choose, Qdrant is a great tool to build agentic RAG systems. Please check out &lt;a href="https://dev.to/documentation/frameworks/"&gt;our integrations&lt;/a&gt; to choose the best one for your use case and preferences. The easiest way to start using Qdrant is to use our managed service, &lt;a href="https://cloud.qdrant.io" rel="noopener noreferrer"&gt;Qdrant Cloud&lt;/a&gt;. A free 1GB cluster is &lt;br&gt;
available for free, so you can start building your agentic RAG system in minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;p&gt;See how Qdrant integrates with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/autogen/" rel="noopener noreferrer"&gt;Autogen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/crewai/" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/langgraph/" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/swarm/" rel="noopener noreferrer"&gt;Swarm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>qdrant</category>
      <category>rag</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance!</title>
      <dc:creator>Kacper Łukawski</dc:creator>
      <pubDate>Thu, 29 Aug 2024 15:59:51 +0000</pubDate>
      <link>https://dev.to/qdrant/any-embedding-model-can-become-a-late-interaction-model-if-you-give-it-a-chance-3iip</link>
      <guid>https://dev.to/qdrant/any-embedding-model-can-become-a-late-interaction-model-if-you-give-it-a-chance-3iip</guid>
      <description>&lt;p&gt;* At least any open-source model, since you need access to its internals.&lt;/p&gt;

&lt;h3&gt;
  
  
  You Can Adapt Dense Embedding Models for Late Interaction
&lt;/h3&gt;

&lt;p&gt;Qdrant 1.10 introduced support for multi-vector representations, with late interaction being a prominent example of this model. In essence, both documents and queries are represented by multiple vectors, and identifying the most relevant documents involves calculating a score based on the similarity between corresponding query and document embeddings. If you're not familiar with this paradigm, our updated &lt;a href="https://dev.to/articles/hybrid-search/"&gt;Hybrid Search&lt;/a&gt; article explains how multi-vector representations can enhance retrieval quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; We can visualize late interaction between corresponding document-query embedding pairs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3zfr1tmzptr7ra0sjaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3zfr1tmzptr7ra0sjaz.png" alt="Late interaction" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are many specialized late interaction models, such as ColBERT, but &lt;strong&gt;it appears that regular dense embedding models can also be effectively utilized in this manner&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In this study, we will demonstrate that standard dense embedding models, traditionally used for single-vector representations, can be effectively adapted for late interaction scenarios using output token embeddings as multi-vector representations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By testing out retrieval with Qdrant’s multi-vector feature, we will show that these models can rival or surpass specialized late interaction models in retrieval performance, while offering lower complexity and greater efficiency. This work redefines the potential of dense models in advanced search pipelines, presenting a new method for optimizing retrieval systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Embedding Models
&lt;/h2&gt;

&lt;p&gt;The inner workings of embedding models might be surprising to some. The model doesn’t operate directly on the input text; instead, it requires a tokenization step to convert the text into a sequence of token identifiers. Each token identifier is then passed through an embedding layer, which transforms it into a dense vector. Essentially, the embedding layer acts as a lookup table that maps token identifiers to dense vectors. These vectors are then fed into the transformer model as input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2:&lt;/strong&gt; The tokenization step, which takes place before vectors are added to the transformer model. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81uxh0sz0fagvusxibqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81uxh0sz0fagvusxibqu.png" alt="Input token embeddings" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The input token embeddings are context-free and are learned during the model’s training process. This means that each token always receives the same embedding, regardless of its position in the text. At this stage, the token embeddings are unaware of the context in which they appear. It is the transformer model’s role to contextualize these embeddings.&lt;/p&gt;

&lt;p&gt;Much has been discussed about the role of attention in transformer models, but in essence, this mechanism is responsible for capturing cross-token relationships. Each transformer module takes a sequence of token embeddings as input and produces a sequence of output token embeddings. Both sequences are of the same length, with each token embedding being enriched by information from the other token embeddings at the current step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 3:&lt;/strong&gt; The mechanism which produces a sequence of output token embeddings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdg5yjc55utu5fn30n0zx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdg5yjc55utu5fn30n0zx.png" alt="Output token embeddings" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 4:&lt;/strong&gt; The final step performed by the embedding model is pooling the output token embeddings to generate a single vector representation of the input text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feop04kiezatxax2jsk3f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feop04kiezatxax2jsk3f.png" alt="Pooling" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are several pooling strategies, but regardless of which one a model uses, the output is always a single vector representation, which inevitably loses some information about the input. It’s akin to giving someone detailed, step-by-step directions to the nearest grocery store versus simply pointing in the general direction. While the vague direction might suffice in some cases, the detailed instructions are more likely to lead to the desired outcome.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Output Token Embeddings for Multi-Vector Representations
&lt;/h3&gt;

&lt;p&gt;We often overlook the output token embeddings, but the fact is—they also serve as multi-vector representations of the input text. So, why not explore their use in a multi-vector retrieval model, similar to late interaction models?&lt;/p&gt;

&lt;h4&gt;
  
  
  Experimental Findings
&lt;/h4&gt;

&lt;p&gt;We conducted several experiments to determine whether output token embeddings could be effectively used in place of traditional late interaction models. The results are quite promising.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th&gt;Dataset&lt;/th&gt;
            &lt;th&gt;Model&lt;/th&gt;
            &lt;th&gt;Experiment&lt;/th&gt;
            &lt;th&gt;NDCG@10&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;th rowspan="6"&gt;SciFact&lt;/th&gt;
            &lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;sparse vectors&lt;/td&gt;
            &lt;td&gt;0.70928&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;late interaction model&lt;/td&gt;
            &lt;td&gt;0.69579&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.64508&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.70724&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.68213&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.73696&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td colspan="4"&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th rowspan="6"&gt;NFCorpus&lt;/th&gt;
            &lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;sparse vectors&lt;/td&gt;
            &lt;td&gt;0.34166&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;late interaction model&lt;/td&gt;
            &lt;td&gt;0.35036&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.31594&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.35779&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.29696&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.37502&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td colspan="4"&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th rowspan="6"&gt;ArguAna&lt;/th&gt;
            &lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;sparse vectors&lt;/td&gt;
            &lt;td&gt;0.47271&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;late interaction model&lt;/td&gt;
            &lt;td&gt;0.44534&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.50167&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.45997&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.58857&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.57648&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href="https://github.com/kacperlukawski/beir-qdrant/blob/main/examples/retrieval/search/evaluate_all_exact.py" rel="noopener noreferrer"&gt;source code for these experiments is open-source&lt;/a&gt; and utilizes &lt;a href="https://github.com/kacperlukawski/beir-qdrant" rel="noopener noreferrer"&gt;&lt;code&gt;beir-qdrant&lt;/code&gt;&lt;/a&gt;, an integration of Qdrant with the &lt;a href="https://github.com/beir-cellar/beir" rel="noopener noreferrer"&gt;BeIR library&lt;/a&gt;. While this package is not officially maintained by the Qdrant team, it may prove useful for those interested in experimenting with various Qdrant configurations to see how they impact retrieval quality. All experiments were conducted using Qdrant in exact search mode, ensuring the results are not influenced by approximate search.&lt;/p&gt;

&lt;p&gt;Even the simple &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model can be applied in a late interaction model fashion, resulting in a positive impact on retrieval quality. However, the best results were achieved with the &lt;code&gt;BAAI/bge-small-en&lt;/code&gt; model, which outperformed both sparse and late interaction models.&lt;/p&gt;

&lt;p&gt;It's important to note that ColBERT has not been trained on BeIR datasets, making its performance fully out-of-domain. Nevertheless, the &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data" rel="noopener noreferrer"&gt;training dataset&lt;/a&gt; also lacks any BeIR data, yet it still performs remarkably well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparative Analysis of Dense vs. Late Interaction Models
&lt;/h3&gt;

&lt;p&gt;The retrieval quality speaks for itself, but there are other important factors to consider.&lt;/p&gt;

&lt;p&gt;The traditional dense embedding models we tested are less complex than late interaction or sparse models. With fewer parameters, these models are expected to be faster during inference and more cost-effective to maintain. Below is a comparison of the models used in the experiments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Number of parameters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;109,514,298&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;109,580,544&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;33,360,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;22,713,216&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One argument against using output token embeddings is the increased storage requirements compared to ColBERT-like models. For instance, the &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model produces 384-dimensional output token embeddings, which is three times more than the 128-dimensional embeddings generated by ColBERT-like models. This increase not only leads to higher memory usage but also impacts the computational cost of retrieval, as calculating distances takes more time. Mitigating this issue through vector compression would make a lot of sense.&lt;/p&gt;

&lt;h4&gt;
  
  
  Exploring Quantization for Multi-Vector Representations
&lt;/h4&gt;

&lt;p&gt;Binary quantization is generally more effective for high-dimensional vectors, making the &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model, with its relatively low-dimensional outputs, less ideal for this approach. However, scalar quantization appeared to be a viable alternative. The table below summarizes the impact of quantization on retrieval quality.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th&gt;Dataset&lt;/th&gt;
            &lt;th&gt;Model&lt;/th&gt;
            &lt;th&gt;Experiment&lt;/th&gt;
            &lt;th&gt;NDCG@10&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;th rowspan="2"&gt;SciFact&lt;/th&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.70724&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings (uint8)&lt;/td&gt;
            &lt;td&gt;0.70297&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td colspan="4"&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th rowspan="2"&gt;NFCorpus&lt;/th&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.35779&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings (uint8)&lt;/td&gt;
            &lt;td&gt;0.35572&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It’s important to note that quantization doesn’t always preserve retrieval quality at the same level, but in this case, scalar quantization appears to have minimal impact on retrieval performance. The effect is negligible, while the memory savings are substantial.&lt;/p&gt;

&lt;p&gt;We managed to maintain the original quality while using four times less memory. Additionally, a quantized vector requires 384 bytes, compared to ColBERT’s 512 bytes. This results in a 25% reduction in memory usage, with retrieval quality remaining nearly unchanged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Application: Enhancing Retrieval with Dense Models
&lt;/h3&gt;

&lt;p&gt;If you’re using one of the sentence transformer models, the output token embeddings are calculated by default. While a single vector representation is more efficient in terms of storage and computation, there’s no need to discard the output token embeddings. According to our experiments, these embeddings can significantly enhance retrieval quality. You can store both the single vector and the output token embeddings in Qdrant, using the single vector for the initial retrieval step and then reranking the results with the output token embeddings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 5:&lt;/strong&gt; A single model pipeline that relies solely on the output token embeddings for reranking.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feosc6fs0fez7z7uvt6m9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feosc6fs0fez7z7uvt6m9.png" alt="Single model reranking" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To demonstrate this concept, we implemented a simple reranking pipeline in Qdrant. This pipeline uses a dense embedding model for the initial oversampled retrieval and then relies solely on the output token embeddings for the reranking step.&lt;/p&gt;

&lt;h4&gt;
  
  
  Single Model Retrieval and Reranking Benchmarks
&lt;/h4&gt;

&lt;p&gt;Our tests focused on using the same model for both retrieval and reranking. The reported metric is &lt;a href="mailto:NDCG@10"&gt;NDCG@10&lt;/a&gt;. In all tests, we applied an oversampling factor of 5x, meaning the retrieval step returned 50 results, which were then narrowed down to 10 during the reranking step. Below are the results for some of the BeIR datasets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th rowspan="2"&gt;Dataset&lt;/th&gt;
            &lt;th colspan="2"&gt;&lt;code&gt;all-miniLM-L6-v2&lt;/code&gt;&lt;/th&gt;
            &lt;th colspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;dense embeddings only&lt;/th&gt;
            &lt;th&gt;dense + reranking&lt;/th&gt;
            &lt;th&gt;dense embeddings only&lt;/th&gt;
            &lt;th&gt;dense + reranking&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;th&gt;SciFact&lt;/th&gt;
            &lt;td&gt;0.64508&lt;/td&gt;
            &lt;td&gt;0.70293&lt;/td&gt;
            &lt;td&gt;0.68213&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.73053&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;NFCorpus&lt;/th&gt;
            &lt;td&gt;0.31594&lt;/td&gt;
            &lt;td&gt;0.34297&lt;/td&gt;
            &lt;td&gt;0.29696&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.35996&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;ArguAna&lt;/th&gt;
            &lt;td&gt;0.50167&lt;/td&gt;
            &lt;td&gt;0.45378&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.58857&lt;/u&gt;&lt;/td&gt;
            &lt;td&gt;0.57302&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;Touche-2020&lt;/th&gt;
            &lt;td&gt;0.16904&lt;/td&gt;
            &lt;td&gt;0.19693&lt;/td&gt;
            &lt;td&gt;0.13055&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.19821&lt;/u&gt;&lt;/td&gt;        
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;TREC-COVID&lt;/th&gt;
            &lt;td&gt;0.47246&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.6379&lt;/u&gt;&lt;/td&gt;
            &lt;td&gt;0.45788&lt;/td&gt;
            &lt;td&gt;0.53539&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;FiQA-2018&lt;/th&gt;
            &lt;td&gt;0.36867&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.41587&lt;/u&gt;&lt;/td&gt;
            &lt;td&gt;0.31091&lt;/td&gt;
            &lt;td&gt;0.39067&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The source code for the benchmark is publicly available, and &lt;a href="https://github.com/kacperlukawski/beir-qdrant/blob/main/examples/retrieval/search/evaluate_reranking.py" rel="noopener noreferrer"&gt;you can find it in the repository of the &lt;code&gt;beir-qdrant&lt;/code&gt; package&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Overall, adding a reranking step using the same model typically improves retrieval quality. However, the quality of various late interaction models is &lt;a href="https://huggingface.co/mixedbread-ai/mxbai-colbert-large-v1#1-reranking-performance" rel="noopener noreferrer"&gt;often reported based on their reranking performance when BM25 is used for the initial retrieval&lt;/a&gt;. This experiment aimed to demonstrate how a single model can be effectively used for both retrieval and reranking, and the results are quite promising.&lt;/p&gt;

&lt;p&gt;Now, let's explore how to implement this using the new Query API introduced in Qdrant 1.10.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guide: Setting Up Qdrant for Late Interaction
&lt;/h3&gt;

&lt;p&gt;The new Query API in Qdrant 1.10 enables the construction of even more complex retrieval pipelines. We can use the single vector created after pooling for the initial retrieval step and then rerank the results using the output token embeddings.&lt;/p&gt;

&lt;p&gt;Assuming the collection is named &lt;code&gt;my-collection&lt;/code&gt; and is configured to store two named vectors: &lt;code&gt;dense-vector&lt;/code&gt; and &lt;code&gt;output-token-embeddings&lt;/code&gt;, here’s how such a collection could be created in Qdrant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:6333&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-collection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dense-vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output-token-embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;multivector_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MultiVectorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;comparator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MultiVectorComparator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAX_SIM&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both vectors are of the same size since they are produced by the same &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, instead of using the search API with just a single dense vector, we can create a reranking pipeline. First, we retrieve 50 results using the dense vector, and then we rerank them using the output token embeddings to obtain the top 10 results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What else can be done with just all-MiniLM-L6-v2 model?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_points&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-collection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prefetch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="c1"&gt;# Prefetch the dense embeddings of the top-50 documents
&lt;/span&gt;        &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Prefetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dense-vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="c1"&gt;# Rerank the top-50 documents retrieved by the dense embedding model
&lt;/span&gt;    &lt;span class="c1"&gt;# and return just the top-10. Please note we call the same model, but
&lt;/span&gt;    &lt;span class="c1"&gt;# we ask for the token embeddings by setting the output_value parameter.
&lt;/span&gt;    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output-token-embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Try the Experiment Yourself
&lt;/h3&gt;

&lt;p&gt;In a real-world scenario, you might take it a step further by first calculating the token embeddings and then performing pooling to obtain the single vector representation. This approach allows you to complete everything in a single pass.&lt;/p&gt;

&lt;p&gt;The simplest way to start experimenting with building complex reranking pipelines in Qdrant is by using the forever-free cluster on &lt;a href="https://cloud.qdrant.io/" rel="noopener noreferrer"&gt;Qdrant Cloud&lt;/a&gt; and reading &lt;a href="https://dev.to/documentation/"&gt;Qdrant's documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/kacperlukawski/beir-qdrant/blob/main/examples/retrieval/search/evaluate_all_exact.py" rel="noopener noreferrer"&gt;source code for these experiments is open-source&lt;/a&gt; and uses &lt;a href="https://github.com/kacperlukawski/beir-qdrant" rel="noopener noreferrer"&gt;&lt;code&gt;beir-qdrant&lt;/code&gt;&lt;/a&gt;, an integration of Qdrant with the &lt;a href="https://github.com/beir-cellar/beir" rel="noopener noreferrer"&gt;BeIR library&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Future Directions and Research Opportunities
&lt;/h3&gt;

&lt;p&gt;The initial experiments using output token embeddings in the retrieval process have yielded promising results. However, we plan to conduct further benchmarks to validate these findings and explore the incorporation of sparse methods for the initial retrieval. Additionally, we aim to investigate the impact of quantization on multi-vector representations and its effects on retrieval quality. Finally, we will assess retrieval speed, a crucial factor for many applications.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>opensource</category>
      <category>datascience</category>
    </item>
    <item>
      <title>What is Hybrid Search?</title>
      <dc:creator>Kacper Łukawski</dc:creator>
      <pubDate>Tue, 06 Feb 2024 15:33:52 +0000</pubDate>
      <link>https://dev.to/qdrant/what-is-hybrid-search-383g</link>
      <guid>https://dev.to/qdrant/what-is-hybrid-search-383g</guid>
      <description>&lt;p&gt;There is not a single definition of hybrid search. Actually, if we use more than one search algorithm, it might be described as some sort of hybrid. Some of the most popular definitions are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A combination of vector search with &lt;a href="https://qdrant.tech/documentation/filtering/"&gt;attribute filtering&lt;/a&gt;. We won't dive much into details, as we like to call it just filtered vector search.&lt;/li&gt;
&lt;li&gt;Vector search with keyword-based search. This one is covered in this article.&lt;/li&gt;
&lt;li&gt;A mix of dense and sparse vectors. That strategy will be covered in the upcoming article.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why do we still need keyword search?
&lt;/h2&gt;

&lt;p&gt;A keyword-based search was the obvious choice for search engines in the past. It struggled with some common issues, but since we didn't have any alternatives, we had to overcome them with additional preprocessing of the documents and queries. &lt;/p&gt;

&lt;p&gt;Vector search turned out to be a breakthrough, as it has some clear advantages in the following scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🌍 Multi-lingual &amp;amp; multi-modal search&lt;/li&gt;
&lt;li&gt;🤔 For short texts with typos and ambiguous content-dependent meanings&lt;/li&gt;
&lt;li&gt;👨‍🔬 Specialized domains with tuned encoder models&lt;/li&gt;
&lt;li&gt;📄 Document-as-a-Query similarity search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn't mean we do not keyword search anymore. There are also some cases in which this kind of method might be useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🌐💭 Out-of-domain search. Words are just words, no matter what they mean. BM25 ranking represents the universal property of the natural language - less frequent words are more important, as they carry most of the meaning.&lt;/li&gt;
&lt;li&gt;⌨️💨 Search-as-you-type, when there are only a few characters types in, and we cannot use vector search yet.&lt;/li&gt;
&lt;li&gt;🎯🔍 Exact phrase matching when we want to find the occurrences of a specific term in the documents. That's especially useful for names of the products, people, part numbers, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Matching the tool to the task
&lt;/h2&gt;

&lt;p&gt;There are various cases in which we need search capabilities and each of those cases will have some different requirements. Therefore, there is not just one strategy to rule them all, and some different tools may fit us better. Text search itself might be roughly divided into multiple specializations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Web-scale search - documents retrieval&lt;/li&gt;
&lt;li&gt;Fast search-as-you-type&lt;/li&gt;
&lt;li&gt;Search over less-than-natural texts (logs, transactions, code, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of those scenarios has a specific tool, which performs better for that specific use case. If you already expose search capabilities, then you probably have one of them in your tech stack. And we can easily combine those tools with vector search to get the best of both worlds. &lt;/p&gt;

&lt;h2&gt;
  
  
  The fast search: A Fallback strategy
&lt;/h2&gt;

&lt;p&gt;The easiest way to incorporate vector search into the existing stack is to treat it as some sort of fallback strategy. So whenever your keyword search struggle with finding proper results, you can run a semantic search to extend the results. That is especially important in cases like search-as-you-type in which a new query is fired every single time your user types the next character in. &lt;/p&gt;

&lt;p&gt;For such cases the speed of the search is crucial. Therefore, we can't use vector search on every query. At the same &lt;br&gt;
time, the simple prefix search might have a bad recall.&lt;/p&gt;

&lt;p&gt;In this case, a good strategy is to use vector search only when the keyword/prefix search returns none or just a small number of results. A good candidate for this is &lt;a href="https://www.meilisearch.com/"&gt;MeiliSearch&lt;/a&gt;. &lt;br&gt;
It uses custom ranking rules to provide results as fast as the user can type.&lt;/p&gt;

&lt;p&gt;The pseudocode of such strategy may go as following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Get fast results from MeiliSearch
&lt;/span&gt;    &lt;span class="n"&gt;keyword_search_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_meili&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if there are enough results
&lt;/span&gt;    &lt;span class="c1"&gt;# or if the results are good enough for given query
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;are_results_enough&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword_search_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;keyword_search&lt;/span&gt;

    &lt;span class="c1"&gt;# Encoding takes time, but we get more results
&lt;/span&gt;    &lt;span class="n"&gt;vector_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;vector_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_qdrant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;vector_result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The precise search: The re-ranking strategy
&lt;/h2&gt;

&lt;p&gt;In the case of document retrieval, we care more about the search result quality and time is not a huge constraint. &lt;br&gt;
There is a bunch of search engines that specialize in the full-text search we found interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/quickwit-oss/tantivy"&gt;Tantivy&lt;/a&gt; - a full-text indexing library written in Rust. Has a great 
performance and featureset. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/lnx-search/lnx"&gt;lnx&lt;/a&gt; - a young but promising project, utilizes Tanitvy as a backend.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/zinclabs/zinc"&gt;ZincSearch&lt;/a&gt; - a project written in Go, focused on minimal resource usage 
and high performance.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/valeriansaliou/sonic"&gt;Sonic&lt;/a&gt; - a project written in Rust, uses custom network communication 
protocol for fast communication between the client and the server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of those engines might be easily used in combination with the vector search offered by Qdrant. But the exact way how to combine the results of both algorithms to achieve the best search precision might be still unclear. So we need to understand how to do it effectively. We will be using reference datasets to benchmark the search quality.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why not linear combination?
&lt;/h2&gt;

&lt;p&gt;It's often proposed to use full-text and vector search scores to form a linear combination formula to rerank the results. So it goes like this:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;final_score = 0.7 * vector_score + 0.3 * full_text_score&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;However, we didn't even consider such a setup. Why? Those scores don't make the problem linearly separable. We used BM25 score along with cosine vector similarity to use both of them as points coordinates in 2-dimensional space. The chart shows how those points are distributed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy42vdb7ctcwylc5kprvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy42vdb7ctcwylc5kprvj.png" alt="A distribution of both Qdrant and BM25 scores mapped into 2D space." width="800" height="556"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A distribution of both Qdrant and BM25 scores mapped into 2D space. It clearly shows relevant and non-relevant objects are not linearly separable in that space, so using a linear combination of both scores won't give us a proper hybrid search.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Both relevant and non-relevant items are mixed. &lt;strong&gt;None of the linear formulas would be able to distinguish between them.&lt;/strong&gt; Thus, that's not the way to solve it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to approach re-ranking?
&lt;/h2&gt;

&lt;p&gt;There is a common approach to re-rank the search results with a model that takes some additional factors into account. Those models are usually trained on clickstream data of a real application and tend to be very business-specific. Thus, we'll not cover them right now, as there is a more general approach. We will&lt;br&gt;
use so-called &lt;strong&gt;cross-encoder models&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Cross-encoder takes a pair of texts and predicts the similarity of them. Unlike embedding models, cross-encoders do not compress text into vector, but uses interactions between individual tokens of both texts. In general, they are more powerful than both BM25 and vector search, but they are also way slower. That makes it feasible to use cross-encoders only for re-ranking of some preselected candidates.&lt;/p&gt;

&lt;p&gt;This is how a pseudocode for that strategy look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;keyword_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_keyword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vector_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_qdrant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;all_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword_search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# parallel calls
&lt;/span&gt;    &lt;span class="n"&gt;rescored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cross_encoder_rescore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;all_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rescored&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is worth mentioning that queries to keyword search and vector search and re-scoring can be done in parallel.&lt;br&gt;
Cross-encoder can start scoring results as soon as the fastest search engine returns the results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Experiments
&lt;/h2&gt;

&lt;p&gt;For that benchmark, there have been 3 experiments conducted:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Vector search with Qdrant&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All the documents and queries are vectorized with &lt;a href="https://www.sbert.net/docs/pretrained_models.html"&gt;all-MiniLM-L6-v2&lt;/a&gt; &lt;br&gt;
   model, and compared with cosine similarity.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Keyword-based search with BM25&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All the documents are indexed by BM25 and queried with its default configuration.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Vector and keyword-based candidates generation and cross-encoder reranking&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both Qdrant and BM25 provides N candidates each and &lt;br&gt;
   &lt;a href="https://www.sbert.net/docs/pretrained-models/ce-msmarco.html"&gt;ms-marco-MiniLM-L-6-v2&lt;/a&gt; cross encoder performs reranking &lt;br&gt;
   on those candidates only. This is an approach that makes it possible to use the power of semantic and keyword based &lt;br&gt;
   search together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cltcphasz6j6jst5l5j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cltcphasz6j6jst5l5j.png" alt="The design of all the three experiments" width="800" height="632"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality metrics
&lt;/h3&gt;

&lt;p&gt;There are various ways of how to measure the performance of search engines, and &lt;em&gt;&lt;a href="https://neptune.ai/blog/recommender-systems-metrics"&gt;Recommender Systems: Machine Learning Metrics and Business Metrics&lt;/a&gt;&lt;/em&gt; is a great introduction to that topic. &lt;br&gt;
I selected the following ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NDCG@5, NDCG@10&lt;/li&gt;
&lt;li&gt;DCG@5, DCG@10&lt;/li&gt;
&lt;li&gt;MRR@5, MRR@10&lt;/li&gt;
&lt;li&gt;Precision@5, Precision@10&lt;/li&gt;
&lt;li&gt;Recall@5, Recall@10&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since both systems return a score for each result, we could use DCG and NDCG metrics. However, BM25 scores are not normalized be default. We performed the normalization to a range &lt;code&gt;[0, 1]&lt;/code&gt; by dividing each score by the maximum score returned for that query. &lt;/p&gt;

&lt;h3&gt;
  
  
  Datasets
&lt;/h3&gt;

&lt;p&gt;There are various benchmarks for search relevance available. Full-text search has been a strong baseline for most of them. However, there are also cases in which semantic search works better by default. For that article, I'm performing &lt;strong&gt;zero shot search&lt;/strong&gt;, meaning our models didn't have any prior exposure to the benchmark datasets, so this is effectively an out-of-domain search.&lt;/p&gt;

&lt;h4&gt;
  
  
  Home Depot
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.kaggle.com/competitions/home-depot-product-search-relevance/"&gt;Home Depot dataset&lt;/a&gt; consists of real inventory and search queries from Home Depot's website with a relevancy score from 1 (not relevant) to 3 (highly relevant).&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Anna Montoya, RG, Will Cukierski. (2016). Home Depot Product Search Relevance. Kaggle. 
https://kaggle.com/competitions/home-depot-product-search-relevance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There are over 124k products with textual descriptions in the dataset and around 74k search queries with the relevancy&lt;br&gt;
score assigned. For the purposes of our benchmark, relevancy scores were also normalized.&lt;/p&gt;

&lt;h4&gt;
  
  
  WANDS
&lt;/h4&gt;

&lt;p&gt;I also selected a relatively new search relevance dataset. &lt;a href="https://github.com/wayfair/WANDS"&gt;WANDS&lt;/a&gt;, which stands for Wayfair ANnotation Dataset, is designed to evaluate search engines for e-commerce.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WANDS: Dataset for Product Search Relevance Assessment
Yan Chen, Shujian Liu, Zheng Liu, Weiyi Sun, Linas Baltrunas and Benjamin Schroeder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In a nutshell, the dataset consists of products, queries and human annotated relevancy labels. Each product has various textual attributes, as well as facets. The relevancy is provided as textual labels: “Exact”, “Partial” and “Irrelevant” and authors suggest to convert those to 1, 0.5 and 0.0 respectively. There are 488 queries with a varying number of relevant items each.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results
&lt;/h2&gt;

&lt;p&gt;Both datasets have been evaluated with the same experiments. The achieved performance is shown in the tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Home Depot
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flchcvkxnax22suzczx26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flchcvkxnax22suzczx26.png" alt="The results of all the experiments conducted on Home Depot dataset" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The results achieved with BM25 alone are better than with Qdrant only. However, if we combine both methods into hybrid search with an additional cross encoder as a last step, then that gives great improvement over any baseline method.&lt;/p&gt;

&lt;p&gt;With the cross-encoder approach, Qdrant retrieved about 56.05% of the relevant items on average, while BM25 fetched 59.16%. Those numbers don't sum up to 100%, because some items were returned by both systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  WANDS
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfrdztzd4mq5nrs9l2dx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfrdztzd4mq5nrs9l2dx.png" alt="The results of all the experiments conducted on WANDS dataset" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The dataset seems to be more suited for semantic search, but the results might be also improved if we decide to use a hybrid search approach with cross encoder model as a final step.&lt;/p&gt;

&lt;p&gt;Overall, combining both full-text and semantic search with an additional reranking step seems to be a good idea, as we are able to benefit the advantages of both methods.&lt;/p&gt;

&lt;p&gt;Again, it's worth mentioning that with the 3rd experiment, with cross-encoder reranking, Qdrant returned more than 48.12% of the relevant items and BM25 around 66.66%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some anecdotal observations
&lt;/h2&gt;

&lt;p&gt;None of the algorithms works better in all the cases. There might be some specific queries in which keyword-based search will be a winner and the other way around. The table shows some interesting examples we could find in WANDS dataset during the experiments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
   &lt;thead&gt;
      &lt;th&gt;Query&lt;/th&gt;
      &lt;th&gt;BM25 Search&lt;/th&gt;
      &lt;th&gt;Vector Search&lt;/th&gt;
   &lt;/thead&gt;
   &lt;tbody&gt;
      &lt;tr&gt;
         &lt;th&gt;cybersport desk&lt;/th&gt;
         &lt;td&gt;desk ❌&lt;/td&gt;
         &lt;td&gt;gaming desk ✅&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;th&gt;plates for icecream&lt;/th&gt;
         &lt;td&gt;"eat" plates on wood wall décor ❌&lt;/td&gt;
         &lt;td&gt;alicyn 8.5 '' melamine dessert plate ✅&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;th&gt;kitchen table with a thick board&lt;/th&gt;
         &lt;td&gt;craft kitchen acacia wood cutting board ❌&lt;/td&gt;
         &lt;td&gt;industrial solid wood dining table ✅&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;th&gt;wooden bedside table&lt;/th&gt;
         &lt;td&gt;30 '' bedside table lamp ❌&lt;/td&gt;
         &lt;td&gt;portable bedside end table ✅&lt;/td&gt;
      &lt;/tr&gt;

   &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Also examples where keyword-based search did better:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
   &lt;thead&gt;
      &lt;th&gt;Query&lt;/th&gt;
      &lt;th&gt;BM25 Search&lt;/th&gt;
      &lt;th&gt;Vector Search&lt;/th&gt;
   &lt;/thead&gt;
   &lt;tbody&gt;
      &lt;tr&gt;
         &lt;th&gt;computer chair&lt;/th&gt;
         &lt;td&gt;vibrant computer task chair ✅&lt;/td&gt;
         &lt;td&gt;office chair ❌&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;th&gt;64.2 inch console table&lt;/th&gt;
         &lt;td&gt;cervantez 64.2 '' console table ✅&lt;/td&gt;
         &lt;td&gt;69.5 '' console table ❌&lt;/td&gt;
      &lt;/tr&gt;
   &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  A wrap up
&lt;/h2&gt;

&lt;p&gt;Each search scenario requires a specialized tool to achieve the best results possible. Still, combining multiple tools with minimal overhead is possible to improve the search precision even further. Introducing vector search into an existing search stack doesn't need to be a revolution but just one small step at a time. &lt;/p&gt;

&lt;p&gt;You'll never cover all the possible queries with a list of synonyms, so a full-text search may not find all the relevant documents. There are also some cases in which your users use different terminology than the one you have in your database. &lt;/p&gt;

&lt;p&gt;Those problems are easily solvable with neural vector embeddings, and combining both approaches with an additional reranking step is possible. So you don't need to resign from your well-known full-text search mechanism but extend it with vector search to support the queries you haven't foreseen.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vectordatabase</category>
      <category>tutorial</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
