<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Qdrant</title>
    <description>The latest articles on DEV Community by Qdrant (@qdrant).</description>
    <link>https://dev.to/qdrant</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F5330%2F84d1e97d-d01b-4ede-9c69-06112433b11b.png</url>
      <title>DEV Community: Qdrant</title>
      <link>https://dev.to/qdrant</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/qdrant"/>
    <language>en</language>
    <item>
      <title>Relevance Feedback in Informational Retrieval</title>
      <dc:creator>Evgeniya Sukhodolskaya</dc:creator>
      <pubDate>Mon, 31 Mar 2025 21:08:23 +0000</pubDate>
      <link>https://dev.to/qdrant/relevance-feedback-in-informational-retrieval-54np</link>
      <guid>https://dev.to/qdrant/relevance-feedback-in-informational-retrieval-54np</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A problem well stated is a problem half solved.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This quote applies as much to life as it does to information retrieval.&lt;/p&gt;

&lt;p&gt;With a well-formulated query, retrieving the relevant document becomes trivial. In reality, however, most users struggle to precisely define what they are searching for.&lt;/p&gt;

&lt;p&gt;While users may struggle to formulate a perfect request — especially in unfamiliar topics — they can easily judge whether a retrieved answer is relevant or not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Relevance is a powerful feedback mechanism for a retrieval system&lt;/strong&gt; to iteratively refine results in the direction of user interest.&lt;/p&gt;

&lt;p&gt;In 2025, with social media flooded with daily AI breakthroughs, it almost seems like information retrieval is solved, agents can iteratively adjust their search queries while assessing the relevance.&lt;/p&gt;

&lt;p&gt;Of course, there’s a catch: these models still rely on retrieval systems (RAG isn’t dead yet, despite daily predictions of its demise). They receive only a handful of top-ranked results provided by a far simpler and cheaper retriever. As a result, the success of guided retrieval still mainly depends on the retrieval system itself.&lt;/p&gt;

&lt;p&gt;So, we should find a way of effectively and efficiently incorporating relevance feedback directly into a retrieval system. In this article, we’ll explore the approaches proposed in the research literature and try to answer the following question:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If relevance feedback in search is so widely studied and praised as effective, why is it practically not used in dedicated vector search solutions?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dismantling the Relevance Feedback
&lt;/h2&gt;

&lt;p&gt;Both industry and academia tend to reinvent the wheel here and there. So, we first took some time to study and categorize different methods — just in case there was something we could plug directly into Qdrant. The resulting taxonomy isn’t set in stone, but we aim to make it useful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp53vdsz29r21gh1n5ine.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp53vdsz29r21gh1n5ine.png" alt="Types of Relevance Feedback" width="800" height="476"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Types of Relevance Feedback&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-Relevance Feedback (PRF)
&lt;/h2&gt;

&lt;p&gt;Pseudo-Relevance feedback takes the top-ranked documents from the initial retrieval results and treats them as relevant. This approach might seem naive, but it provides a noticeable performance boost in lexical retrieval while being relatively cheap to compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Binary Relevance Feedback
&lt;/h2&gt;

&lt;p&gt;The most straightforward way to gather feedback is to ask users directly if document is relevant. There are two main limitations to this approach:&lt;/p&gt;

&lt;p&gt;First, users are notoriously reluctant to provide feedback. Did you know that &lt;a href="https://en.wikipedia.org/wiki/Google_SearchWiki#:~:text=SearchWiki%20was%20a%20Google%20Search,for%20a%20given%20search%20query" rel="noopener noreferrer"&gt;Google once had&lt;/a&gt; an upvote/downvote mechanism on search results but removed it because almost no one used it?&lt;/p&gt;

&lt;p&gt;Second, even if users are willing to provide feedback, no relevant documents might be present in the initial retrieval results. In this case, the user can’t provide a meaningful signal.&lt;/p&gt;

&lt;p&gt;Instead of asking users, we can ask a smart model to provide binary relevance judgements, but this would limit its potential to generate granular judgements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Re-scored Relevance Feedback
&lt;/h2&gt;

&lt;p&gt;We can also apply more sophisticated methods to extract relevance feedback from the top-ranked documents - machine learning models can provide a relevance score for each document.&lt;/p&gt;

&lt;p&gt;The obvious concern here is twofold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How accurately can the automated judge determine relevance (or irrelevance)?&lt;/li&gt;
&lt;li&gt;How cost-efficient is it? After all, you can’t expect GPT-4o to re-rank thousands of documents for every user query — unless you’re filthy rich.
Nevertheless, automated re-scored feedback could be a scalable way to improve search when explicit binary feedback is not accessible.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Has the Problem Already Been Solved?
&lt;/h2&gt;

&lt;p&gt;Digging through research materials, we expected anything else but to discover that the first relevance feedback study dates back &lt;em&gt;&lt;a href="https://sigir.org/files/museum/pub-08/XXIII-1.pdf" rel="noopener noreferrer"&gt;sixty years&lt;/a&gt;&lt;/em&gt;. In the midst of the neural search bubble, it’s easy to forget that lexical (term-based) retrieval has been around for decades. Naturally, research in that field has had enough time to develop.&lt;/p&gt;

&lt;p&gt;Neural search — aka &lt;a href="https://qdrant.tech/articles/neural-search-tutorial/" rel="noopener noreferrer"&gt;vector search&lt;/a&gt; — gained traction in the industry around 5 years ago. Hence, vector-specific relevance feedback techniques might still be in their early stages, awaiting production-grade validation and industry adoption.&lt;/p&gt;

&lt;p&gt;As a &lt;a href="https://qdrant.tech/articles/dedicated-vector-search/" rel="noopener noreferrer"&gt;dedicated vector search engine&lt;/a&gt;, we would like to be these adopters. Our focus is neural search, but approaches in both lexical and neural retrieval seem worth exploring, as cross-field studies are always insightful, with the potential to reuse well-established methods of one field in another.&lt;/p&gt;

&lt;p&gt;We found some interesting methods applicable to neural search solutions and additionally revealed a gap in the neural search-based relevance feedback approaches. Stick around, and we’ll share our findings!&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Ways to Approach the Problem
&lt;/h2&gt;

&lt;p&gt;Retrieval as a recipe can be broken down into three main ingredients:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Query&lt;/li&gt;
&lt;li&gt;Documents&lt;/li&gt;
&lt;li&gt;Similarity scoring between them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa5zlfoih3dfy703t6x7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa5zlfoih3dfy703t6x7b.png" alt="Research Field Taxonomy Overview" width="800" height="495"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Research Field Taxonomy Overview&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Query formulation is a subjective process – it can be done in infinite configurations, making the relevance of a document unpredictable until the query is formulated and submitted to the system.&lt;/p&gt;

&lt;p&gt;So, adapting documents (or the search index) to relevance feedback would require per-request dynamic changes, which is impractical, considering that modern retrieval systems store billions of documents.&lt;/p&gt;

&lt;p&gt;Thus, approaches for incorporating relevance feedback in search fall into two categories: refining a query and refining the similarity scoring function between the query and documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Query Refinement
&lt;/h2&gt;

&lt;p&gt;There are several ways to refine a query based on relevance feedback. Globally, we prefer to distinguish between two approaches: modifying the query as text and modifying the vector representation of the query.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpg1hmg8871vo5z563vy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpg1hmg8871vo5z563vy.png" alt="Incorporating Relevance Feedback in Query" width="" height=""&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Incorporating Relevance Feedback in Query&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Query As Text
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;term-based retrieval&lt;/strong&gt;, an intuitive way to improve a query would be to expand it with relevant terms. It resembled the “aha, so that’s what it’s called” stage in the discovery search.&lt;/p&gt;

&lt;p&gt;Before the deep learning era of this century, expansion terms were mainly selected using statistical or probabilistic models. The idea was to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Either extract the &lt;strong&gt;most frequent&lt;/strong&gt; terms from (pseudo-)relevant documents;&lt;/li&gt;
&lt;li&gt;Or the &lt;strong&gt;most specific&lt;/strong&gt; ones (for example, according to IDF);&lt;/li&gt;
&lt;li&gt;Or the &lt;strong&gt;most probable&lt;/strong&gt; ones (most likely to be in query according to a relevance set).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Well-known methods of those times come from the family of &lt;a href="https://sigir.org/wp-content/uploads/2017/06/p260.pdf" rel="noopener noreferrer"&gt;Relevance Models&lt;/a&gt;, where terms for expansion are chosen based on their probability in pseudo-relevant documents (how often terms appear) and query terms likelihood given those pseudo-relevant documents - how strongly these pseudo-relevant documents match the query.&lt;/p&gt;

&lt;p&gt;The most famous one, &lt;code&gt;RM3&lt;/code&gt; – interpolation of expansion terms probability with their probability in a query – is still appearing in papers of the last few years as a (noticeably decent) baseline in term-based retrieval, usually as part of &lt;a href="https://github.com/castorini/anserini" rel="noopener noreferrer"&gt;anserini&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frb6370tvbusklbmfm1bg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frb6370tvbusklbmfm1bg.png" alt="Simplified Query Expansion" width="" height=""&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Simplified Query Expansion&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the time approaching the modern machine learning era, &lt;a href="https://aclanthology.org/2020.findings-emnlp.424.pdf" rel="noopener noreferrer"&gt;multiple studies&lt;/a&gt; began claiming that these traditional ways of query expansion are not as effective as they could be.&lt;/p&gt;

&lt;p&gt;Started with simple classifiers based on hand-crafted features, this trend naturally led to use the famous &lt;a href="https://huggingface.co/docs/transformers/model_doc/bert" rel="noopener noreferrer"&gt;BERT (Bidirectional encoder representations from transformers)&lt;/a&gt;. For example, BERT-QE (Query Expansion) authors came up with this schema:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Get pseudo-relevance feedback from the finetuned BERT reranker (~10 documents);&lt;/li&gt;
&lt;li&gt;Chunk these pseudo-relevant documents (~100 words) and score query-chunk relevance with the same reranker;&lt;/li&gt;
&lt;li&gt;Expand the query with the most relevant chunks;&lt;/li&gt;
&lt;li&gt;Rerank 1000 documents with the reranker using the expanded query.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach significantly outperformed BM25 + RM3 baseline in experiments (+11% NDCG@20). However, it required &lt;strong&gt;11.01x&lt;/strong&gt; more computation than just using BERT for reranking, and reranking 1000 documents with BERT would take around 9 seconds alone.&lt;/p&gt;

&lt;p&gt;Query term expansion can hypothetically work for neural retrieval as well. New terms might shift the query vector closer to that of the desired document. However, &lt;a href="https://dl.acm.org/doi/10.1145/3570724" rel="noopener noreferrer"&gt;this approach isn’t guaranteed to succeed.&lt;/a&gt; Neural search depends entirely on embeddings, and how those embeddings are generated — consequently, how similar query and document vectors are — depends heavily on the model’s training.&lt;/p&gt;

&lt;p&gt;It definitely works if &lt;strong&gt;query refining is done by a model operating in the same vector space&lt;/strong&gt;, which typically requires offline training of a retriever. The goal is to extend the query encoder input to also include feedback documents, producing an adjusted query embedding. Examples include &lt;code&gt;ANCE-PRF&lt;/code&gt; and &lt;code&gt;ColBERT-PRF&lt;/code&gt; – ANCE and ColBERT fine-tuned extensions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjmlpwewqttcsy87kyfy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjmlpwewqttcsy87kyfy.png" alt="Generating a new relevance-aware query vector" width="" height=""&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Generating a new relevance-aware query vector&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The reason why you’re most probably not familiar with these models – their absence in the industry – is that their &lt;strong&gt;training&lt;/strong&gt; itself is a &lt;strong&gt;high upfront cost&lt;/strong&gt;, and even though it was “paid”, these models &lt;a href="https://arxiv.org/abs/2108.13454" rel="noopener noreferrer"&gt;struggle with generalization&lt;/a&gt;, performing poorly on out-of-domain tasks (datasets they haven’t seen during training). Additionally, feeding an attention-based model a lengthy input (query + documents) is not a good practice in production settings (attention is quadratic in the input length), where time and money are crucial decision factors.&lt;/p&gt;

&lt;p&gt;Alternatively, one could skip a step — and work directly with vectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Query As Vector
&lt;/h2&gt;

&lt;p&gt;Instead of modifying the initial query, a more scalable approach is to directly adjust the query vector. It is easily applicable across modalities and suitable for both lexical and neural retrieval.&lt;/p&gt;

&lt;p&gt;Although vector search has become a trend in recent years, its core principles have existed in the field for decades. For example, the SMART retrieval system used by &lt;a href="https://sigir.org/files/museum/pub-08/XXIII-1.pdf" rel="noopener noreferrer"&gt;Rocchio&lt;/a&gt; in 1965 for his relevance feedback experiments operated on bag-of-words vector representations of text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftclzn0909awrahye20sx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftclzn0909awrahye20sx.png" alt="Roccio’s Relevance Feedback Method" width="" height=""&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Roccio’s Relevance Feedback Method&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rocchio’s idea&lt;/strong&gt; — to update the query vector by adding a difference between the centroids of relevant and non-relevant documents — seems to translate well to modern dual encoders-based dense retrieval systems. Researchers seem to agree: a study from 2022 demonstrated that the &lt;a href="https://arxiv.org/pdf/2108.11044" rel="noopener noreferrer"&gt;parametrized version of Rocchio&lt;/a&gt;’s method in dense retrieval consistently improves Recall@1000 by 1–5%, while keeping query processing time suitable for production — around 170 ms.&lt;/p&gt;

&lt;p&gt;However, parameters (centroids and query weights) in the dense retrieval version of Roccio’s method must be tuned for each dataset and, ideally, also for each request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gradient Descent-Based Methods
&lt;/h2&gt;

&lt;p&gt;The efficient way of doing so on-the-fly remained an open question until the introduction of a gradient-descent-based Roccio’s method generalization: &lt;code&gt;Test-Time Optimization of Query Representations (TOUR)&lt;/code&gt;. TOUR adapts a query vector over multiple iterations of retrieval and reranking (retrieve → rerank → gradient descent step), guided by a reranker’s relevance judgments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uahnekmzusie7i2euek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uahnekmzusie7i2euek.png" alt="An overview of TOUR iteratively optimizing initial query representation based on pseudo relevance feedback.&amp;lt;br&amp;gt;
Figure adapted from Sung et al., 2023," width="546" height="537"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure adapted from Sung et al., 2023,&lt;/em&gt; &lt;em&gt;&lt;a href="https://arxiv.org/pdf/2205.12680" rel="noopener noreferrer"&gt;Optimizing Test-Time Query Representations for Dense Retrieval&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The next iteration of gradient-based methods of query refinement – &lt;code&gt;ReFit&lt;/code&gt; – proposed in 2024 a lighter, production-friendly alternative to TOUR, limiting retrieve → rerank → gradient descent sequence to only one iteration. The retriever’s query vector is updated through matching (via &lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence" rel="noopener noreferrer"&gt;Kullback–Leibler divergence&lt;/a&gt;) retriever and cross-encoder’s similarity scores distribution over feedback documents. ReFit is model- and language-independent and stably improves Recall@100 metric on 2–3%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cpr22shawr4r1y0t0mn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cpr22shawr4r1y0t0mn.png" alt="An overview of ReFit, a gradient-based method for query refinement" width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gradient descent-based methods seem like a production-viable option, an alternative to finetuning the retriever (distilling it from a reranker). Indeed, it doesn’t require in-advance training and is compatible with any re-ranking models.&lt;/p&gt;

&lt;p&gt;However, a few limitations baked into these methods prevented a broader adoption in the industry.&lt;/p&gt;

&lt;p&gt;The gradient descent-based methods modify elements of the query vector as if it were model parameters; therefore, they require a substantial amount of feedback documents to converge to a stable solution.&lt;/p&gt;

&lt;p&gt;On top of that, the gradient descent-based methods are sensitive to the choice of hyperparameters, leading to query drift, where the query may drift entirely away from the user’s intent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Similarity Scoring
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7rlhnc1gci2yjir47irb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7rlhnc1gci2yjir47irb.png" alt="Incorporating Relevance Feedback in Similarity Scoring" width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another family of approaches is built around the idea of incorporating relevance feedback directly into the similarity scoring function. It might be desirable in cases where we want to preserve the original query intent, but still adjust the similarity score based on relevance feedback.&lt;/p&gt;

&lt;p&gt;In lexical retrieval, this can be as simple as boosting documents that share more terms with those judged as relevant.&lt;/p&gt;

&lt;p&gt;Its neural search counterpart is a &lt;code&gt;k-nearest neighbors-based method&lt;/code&gt; that adjusts the query-document similarity score by adding the sum of similarities between the candidate document and all known relevant examples. This technique yields a significant improvement, around 5.6 percentage points in NDCG@20, but it requires a substantial amount of explicitly labelled feedback documents to be effective.&lt;/p&gt;

&lt;p&gt;In experiments, the knn-based method is treated as a reranker. In all other papers, we also found that adjusting similarity scores based on relevance feedback is centred around &lt;a href="https://qdrant.tech/documentation/search-precision/reranking-semantic-search/" rel="noopener noreferrer"&gt;reranking&lt;/a&gt; – &lt;strong&gt;training or finetuning rerankers to become relevance feedback&lt;/strong&gt;-aware. Typically, experiments include cross-encoders, though &lt;a href="https://arxiv.org/pdf/1904.08861" rel="noopener noreferrer"&gt;simple classifiers are also an option&lt;/a&gt;. These methods generally involve rescoring a broader set of documents retrieved during an initial search, guided by feedback from a smaller top-ranked subset. It is not a similarity matching function adjustment per se but rather a similarity scoring model adjustment.&lt;/p&gt;

&lt;p&gt;Methods typically fall into two categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Training rerankers offline&lt;/strong&gt; to ingest relevance feedback as an additional input at inference time, &lt;a href="https://aclanthology.org/D18-1478.pdf" rel="noopener noreferrer"&gt;as here&lt;/a&gt; — again, attention-based models and lengthy inputs: a production-deadly combination.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Finetuning rerankers&lt;/strong&gt; on relevance feedback from the first retrieval stage, &lt;a href="https://aclanthology.org/2022.emnlp-main.614.pdf" rel="noopener noreferrer"&gt;as Baumgärtner et al. did&lt;/a&gt;, finetuning bias parameters of a small cross-encoder per query on 2,000 feedback documents (still a lot, especially considering that their method works with explicit feedback).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The biggest limitation here is that these reranker-based methods cannot retrieve relevant documents beyond those returned in the initial search, and using rerankers on thousands of documents in production is a no-go – it’s too expensive. Ideally, to avoid that, a similarity scoring function updated with relevance feedback should be used directly in the second retrieval iteration. However, in every research paper we’ve come across, retrieval systems are &lt;strong&gt;treated as black boxes&lt;/strong&gt; — ingesting queries, returning results, and offering no built-in mechanism to modify scoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  So, what are the takeaways?
&lt;/h2&gt;

&lt;p&gt;Pseudo Relevance Feedback (PRF) is known to improve the effectiveness of lexical retrievers. Several PRF-based approaches – mainly query terms expansion-based – are successfully integrated into traditional retrieval systems. At the same time, there are &lt;strong&gt;no known industry-adopted analogues in neural (vector) search dedicated solutions&lt;/strong&gt;; neural search-compatible methods remain stuck in research papers.&lt;/p&gt;

&lt;p&gt;The gap we noticed while studying the field is that researchers have &lt;strong&gt;no direct access to retrieval systems&lt;/strong&gt;, forcing them to design wrappers around the black-box-like retrieval oracles. This is sufficient for query-adjusting methods but not for similarity scoring function adjustment.&lt;/p&gt;

&lt;p&gt;Perhaps relevance feedback methods haven’t made it into the neural search systems for trivial reasons — like no one having the time to find the right balance between cost and efficiency.&lt;/p&gt;

&lt;p&gt;Getting it to work in a production setting means experimenting, building interfaces, and adapting architectures. Simply put, it needs to look worth it. And unlike 2D vector math, high-dimensional vector spaces are anything but intuitive. The curse of dimensionality is real. So is query drift. Even methods that make perfect sense on paper might not work in practice.&lt;/p&gt;

&lt;p&gt;A real-world solution should be simple. Maybe just a little bit smarter than a rule-based approach, but still practical. It shouldn’t require fine-tuning thousands of parameters or feeding paragraphs of text into transformers. &lt;strong&gt;And for it to be effective, it needs to be integrated directly into the retrieval system itself.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>qdrant</category>
      <category>rag</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Introducing Gridstore: Qdrant's Custom Key-Value Store</title>
      <dc:creator>Luis Cossío</dc:creator>
      <pubDate>Thu, 06 Feb 2025 12:07:39 +0000</pubDate>
      <link>https://dev.to/qdrant/introducing-gridstore-qdrants-custom-key-value-store-2li9</link>
      <guid>https://dev.to/qdrant/introducing-gridstore-qdrants-custom-key-value-store-2li9</guid>
      <description>&lt;h2&gt;
  
  
  Why We Built Our Own Storage Engine
&lt;/h2&gt;

&lt;p&gt;Databases need a place to store and retrieve data. That’s what Qdrant's &lt;a href="https://en.wikipedia.org/wiki/Key%E2%80%93value_database" rel="noopener noreferrer"&gt;&lt;strong&gt;key-value storage&lt;/strong&gt;&lt;/a&gt; does—it links keys to values.&lt;/p&gt;

&lt;p&gt;When we started building Qdrant, we needed to pick something ready for the task. So we chose &lt;a href="https://rocksdb.org" rel="noopener noreferrer"&gt;&lt;strong&gt;RocksDB&lt;/strong&gt;&lt;/a&gt; as our embedded key-value store.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwriweh16glaxmgrwt8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwriweh16glaxmgrwt8m.png" alt="Image description" width="730" height="213"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Over time, we ran into issues. Its architecture required compaction (uses &lt;a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree" rel="noopener noreferrer"&gt;LSMT&lt;/a&gt;), which caused random latency spikes. It handles generic keys, while we only use it for sequential IDs. Having lots of configuration options makes it versatile, but accurately tuning it was a headache. Finally, interoperating with C++ slowed us down (although we will still support it for quite some time 😭).&lt;br&gt;
While there are already some good options written in Rust that we could leverage, we needed something custom. Nothing out there fit our needs in the way we wanted. We didn’t require generic keys. We wanted full control over when and which data was written and flushed. Our system already has crash recovery mechanisms built-in. Online compaction isn’t a priority, we already have optimizers for that. Debugging misconfigurations was not a great use of our time.&lt;br&gt;
So we built our own storage. As of &lt;a href="///blog/qdrant-1.13.x/"&gt;&lt;strong&gt;Qdrant Version 1.13&lt;/strong&gt;&lt;/a&gt;, we are using Gridstore for &lt;strong&gt;payload and sparse vector storages&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqg5t7nghedw9zqxlv03l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqg5t7nghedw9zqxlv03l.png" alt="Image description" width="800" height="200"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  In this article, you’ll learn about:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;How Gridstore works&lt;/strong&gt; – a deep dive into its architecture and mechanics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why we built it this way&lt;/strong&gt; – the key design decisions that shaped it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rigorous testing&lt;/strong&gt; – how we ensured the new storage is production-ready.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance benchmarks&lt;/strong&gt; – official metrics that demonstrate its efficiency.
&lt;strong&gt;Our first challenge?&lt;/strong&gt; Figuring out the best way to handle sequential keys and variable-sized data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Gridstore Architecture: Three Main Components
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fng6hr3zi52diiqwdx4sj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fng6hr3zi52diiqwdx4sj.png" alt="Image description" width="800" height="113"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stores values in fixed-sized blocks and retrieves them using a pointer-based lookup system for efficient access.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mask Layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maintains a bitmask to track block usage, distinguishing between allocated and available blocks.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gaps Layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manages block availability at a higher level, optimizing space allocation and reuse.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  1. The Data Layer for Fast Retrieval
&lt;/h3&gt;

&lt;p&gt;At the core of Gridstore is &lt;strong&gt;The Data Layer&lt;/strong&gt;, which is designed to store and retrieve values quickly based on their keys. This layer allows us to do efficient reads and lets us store variable-sized data. The main two components of this layer are &lt;strong&gt;The Tracker&lt;/strong&gt; and &lt;strong&gt;The Data Grid&lt;/strong&gt;.&lt;br&gt;
Since internal IDs are always sequential integers (0, 1, 2, 3, 4, ...), the tracker is an array of pointers, where each pointer tells the system exactly where a value starts and how long it is. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko72b7r8luo39rj2apfo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko72b7r8luo39rj2apfo.png" alt="Image description" width="800" height="271"&gt;&lt;/a&gt;&lt;br&gt;
The Data Layer uses an array of pointers to quickly retrieve data.&lt;/p&gt;

&lt;p&gt;This makes lookups incredibly fast. For example, finding key 3 is just a matter of jumping to the third position in the tracker, and following the pointer to find the value in the data grid. &lt;/p&gt;

&lt;p&gt;However, because values are of variable size, the data itself is stored separately in a grid of fixed-sized blocks, which are grouped into larger page files. The fixed size of each block is usually 128 bytes. When inserting a value, Gridstore allocates one or more consecutive blocks to store it, ensuring that each block only holds data from a single value.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. The Mask Layer Reuses Space
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Mask Layer&lt;/strong&gt; helps Gridstore handle updates and deletions without the need for expensive data compaction. Instead of maintaining complex metadata for each block, Gridstore tracks usage with a bitmask, where each bit represents a block, with 1 for used, 0 for free.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2hhaftlk0xddax0lcwu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2hhaftlk0xddax0lcwu.png" alt="Image description" width="800" height="278"&gt;&lt;/a&gt;&lt;br&gt;
The bitmask efficiently tracks block usage.&lt;/p&gt;

&lt;p&gt;This makes it easy to determine where new values can be written. When a value is removed, it gets soft-deleted at its pointer, and the corresponding blocks in the bitmask are marked as available. Similarly, when updating a value, the new version is written elsewhere, and the old blocks are freed at the bitmask.&lt;br&gt;
This approach ensures that Gridstore doesn’t waste space. As the storage grows, however, scanning for available blocks in the entire bitmask can become computationally expensive.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. The Gaps Layer for Effective Updates
&lt;/h3&gt;

&lt;p&gt;To further optimize update handling, Gridstore introduces &lt;strong&gt;The Gaps Layer&lt;/strong&gt;, which provides a higher-level view of block availability. &lt;br&gt;
Instead of scanning the entire bitmask, Gridstore splits the bitmask into regions and keeps track of the largest contiguous free space within each region, known as &lt;strong&gt;The Region Gap&lt;/strong&gt;. By also storing the leading and trailing gaps of each region, the system can efficiently combine multiple regions when needed for storing large values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wski880wmo2ed91syno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wski880wmo2ed91syno.png" alt="Image description" width="800" height="563"&gt;&lt;/a&gt;&lt;br&gt;
The complete architecture of Gridstore&lt;/p&gt;

&lt;p&gt;This layered approach allows Gridstore to locate available space quickly, scaling down the work required for scans while keeping memory overhead minimal. With this system, finding storage space for new values requires scanning only a tiny fraction of the total metadata, making updates and insertions highly efficient, even in large segments.&lt;br&gt;
Given the default configuration, the gaps layer is scoped out in a millionth fraction of the actual storage size. This means that for each 1GB of data, the gaps layer only requires scanning 6KB of metadata. With this mechanism, the other operations can be executed in virtually constant-time complexity.&lt;/p&gt;
&lt;h2&gt;
  
  
  Gridstore in Production: Maintaining Data Integrity
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fahj36tq3wy0484431sdx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fahj36tq3wy0484431sdx.png" alt="Image description" width="800" height="118"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gridstore’s architecture introduces multiple interdependent structures that must remain in sync to ensure data integrity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Data Layer&lt;/strong&gt; holds the data and associates each key with its location in storage, including page ID, block offset, and the size of its value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Mask Layer&lt;/strong&gt; keeps track of which blocks are occupied and which are free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Gaps Layer&lt;/strong&gt; provides an indexed view of free blocks for efficient space allocation.
Every time a new value is inserted or an existing value is updated, all these components need to be modified in a coordinated way.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  When Things Break in Real Life
&lt;/h3&gt;

&lt;p&gt;Real-world systems don’t operate in a vacuum. Failures happen: software bugs cause unexpected crashes, memory exhaustion forces processes to terminate, disks fail to persist data reliably, and power losses can interrupt operations at any moment. &lt;br&gt;
&lt;em&gt;The critical question is: what happens if a failure occurs while updating these structures?&lt;/em&gt;&lt;br&gt;
If one component is updated but another isn’t, the entire system could become inconsistent. Worse, if an operation is only partially written to disk, it could lead to orphaned data, unusable space, or even data corruption.&lt;/p&gt;
&lt;h3&gt;
  
  
  Stability Through Idempotency: Recovering With WAL
&lt;/h3&gt;

&lt;p&gt;To guard against these risks, Qdrant relies on a &lt;a href="https://dev.to/documentation/concepts/storage/"&gt;&lt;strong&gt;Write-Ahead Log (WAL)&lt;/strong&gt;&lt;/a&gt;. Before committing an operation, Qdrant ensures that it is at least recorded in the WAL. If a crash happens before all updates are flushed, the system can safely replay operations from the log. &lt;/p&gt;

&lt;p&gt;This recovery mechanism introduces another essential property: &lt;a href="https://en.wikipedia.org/wiki/Idempotence" rel="noopener noreferrer"&gt;&lt;strong&gt;idempotence&lt;/strong&gt;&lt;/a&gt;. &lt;br&gt;
The storage system must be designed so that reapplying the same operation after a failure leads to the same final state as if the operation had been applied just once.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Grand Solution: Lazy Updates
&lt;/h3&gt;

&lt;p&gt;To achieve this, &lt;strong&gt;Gridstore completes updates lazily&lt;/strong&gt;, prioritizing the most critical part of the write: the data itself. &lt;/p&gt;

&lt;p&gt;👉 Instead of immediately updating all metadata structures, it writes the new value first while keeping lightweight pending changes in a buffer. &lt;br&gt;
👉 The system only finalizes these updates when explicitly requested, ensuring that a crash never results in marking data as deleted before the update has been safely persisted. &lt;br&gt;
👉 In the worst-case scenario, Gridstore may need to write the same data twice, leading to a minor space overhead, but it will never corrupt the storage by overwriting valid data. &lt;/p&gt;
&lt;h2&gt;
  
  
  How We Tested the Final Product
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsy8rcpa29k45ja0d5u8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsy8rcpa29k45ja0d5u8n.png" alt="Image description" width="800" height="110"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  First... Model Testing
&lt;/h3&gt;

&lt;p&gt;Gridstore can be tested efficiently using model testing, which compares its behavior to a simple in-memory hash map. Since Gridstore should function like a persisted hash map, this method quickly detects inconsistencies.&lt;/p&gt;

&lt;p&gt;The process is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Initialize a Gridstore instance and an empty hash map.&lt;/li&gt;
&lt;li&gt;Run random operations (put, delete, update) on both.&lt;/li&gt;
&lt;li&gt;Verify that results match after each operation.&lt;/li&gt;
&lt;li&gt;Compare all keys and values to ensure consistency.
This approach provides high test coverage, exposing issues like incorrect persistence or faulty deletions. Running large-scale model tests ensures Gridstore remains reliable in real-world use.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a naive way to generate operations in Rust.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;Operation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;Put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PointOffset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;Delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PointOffset&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;Update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PointOffset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;Operation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;Rng&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_point_offset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;point_offset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="nf"&gt;.random_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..=&lt;/span&gt;&lt;span class="n"&gt;max_point_offset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;operation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="nf"&gt;.gen_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;operation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;size_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="nf"&gt;.random_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;random_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size_factor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="nn"&gt;Operation&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;Put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;point_offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;Operation&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;Delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;point_offset&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;size_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="nf"&gt;.random_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;random_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size_factor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="nn"&gt;Operation&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;Update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;point_offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nd"&gt;unreachable!&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Model testing is a high-value way to catch bugs, especially when your system mimics a well-defined component like a hash map. If your component behaves the same as another one, using model testing brings a lot of value for a bit of effort.&lt;/p&gt;

&lt;p&gt;We could have tested against RocksDB, but simplicity matters more. A simple hash map lets us run massive test sequences quickly, exposing issues faster.&lt;/p&gt;

&lt;p&gt;For even sharper debugging, Property-Based Testing adds automated test generation and shrinking. It pinpoints failures with minimalized test cases, making bug hunting faster and more effective.&lt;/p&gt;

&lt;h3&gt;
  
  
  Crash Testing: Can Gridstore Handle the Pressure?
&lt;/h3&gt;

&lt;p&gt;Designing for crash resilience is one thing, and proving it works under stress is another. To push Qdrant’s data integrity to the limit, we built &lt;a href="https://github.com/qdrant/crasher" rel="noopener noreferrer"&gt;&lt;strong&gt;Crasher&lt;/strong&gt;&lt;/a&gt;, a test bench that brutally kills and restarts Qdrant while it handles a heavy update workload.&lt;/p&gt;

&lt;p&gt;Crasher runs a loop that continuously writes data, then randomly crashes Qdrant. On each restart, Qdrant replays its &lt;a href="https://dev.to/documentation/concepts/storage/"&gt;&lt;strong&gt;Write-Ahead Log (WAL)&lt;/strong&gt;&lt;/a&gt;, and we verify if data integrity holds. Possible anomalies include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing data (points, vectors, or payloads)&lt;/li&gt;
&lt;li&gt;Corrupt payload values
This aggressive yet simple approach has uncovered real-world issues when run for extended periods. While we also use chaos testing for distributed setups, Crasher excels at fast, repeatable failure testing in a local environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Testing Gridstore Performance: Benchmarks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83z3krx9wuz8xosjmctg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83z3krx9wuz8xosjmctg.png" alt="Image description" width="800" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To measure the impact of our new storage engine, we used &lt;a href="https://github.com/jonhoo/bustle" rel="noopener noreferrer"&gt;&lt;strong&gt;Bustle, a key-value storage benchmarking framework&lt;/strong&gt;&lt;/a&gt;, to compare Gridstore against RocksDB. We tested three workloads:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload Type&lt;/th&gt;
&lt;th&gt;Operation Distribution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Read-heavy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;95% reads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Insert-heavy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;80% inserts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Update-heavy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50% updates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  The results speak for themselves:
&lt;/h4&gt;

&lt;p&gt;Average latency for all kinds of workloads is lower across the board, particularly for inserts. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07ww9g05p05zru1jaqb5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07ww9g05p05zru1jaqb5.png" alt="Image description" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This shows a clear boost in performance. As we can see, the investment in Gridstore is paying off.&lt;/p&gt;

&lt;h3&gt;
  
  
  End-to-End Benchmarking
&lt;/h3&gt;

&lt;p&gt;Now, let’s test the impact on a real Qdrant instance. So far, we’ve only integrated Gridstore for &lt;a href="https://dev.to/documentation/concepts/payload/"&gt;&lt;strong&gt;payloads&lt;/strong&gt;&lt;/a&gt; and &lt;a href="https://dev.to/documentation/concepts/vectors/#sparse-vectors"&gt;&lt;strong&gt;sparse vectors&lt;/strong&gt;&lt;/a&gt;, but even this partial switch should show noticeable improvements.&lt;/p&gt;

&lt;p&gt;For benchmarking, we used our in-house &lt;a href="https://github.com/qdrant/bfb" rel="noopener noreferrer"&gt;&lt;strong&gt;bfb tool&lt;/strong&gt;&lt;/a&gt; to generate a workload. Our configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;bfb&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;--max-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--sparse-vectors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--set-payload&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--on-disk-payload&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--dim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--sparse-dim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--bool-payloads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--keywords&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--float-payloads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--int-payloads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--text-payloads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--text-payload-length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--skip-field-indices&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;--jsonl-updates&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;./rps.jsonl&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This benchmark upserts 1 million points twice. Each point has: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A medium to large payload&lt;/li&gt;
&lt;li&gt;A tiny dense vector (dense vectors use a different storage type)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - A sparse vector
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Additional configuration:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;The test we conducted updated payload data separately in another request. &lt;/li&gt;
&lt;li&gt;There were no payload indices, which ensured we measured pure ingestion speed.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  3. Finally, we gathered request latency metrics for analysis.
&lt;/h2&gt;

&lt;p&gt;We ran this against Qdrant 1.12.6, toggling between the old and new storage backends. &lt;/p&gt;

&lt;h3&gt;
  
  
  Final Result
&lt;/h3&gt;

&lt;p&gt;Data ingestion is &lt;strong&gt;twice as fast and with a smoother throughput&lt;/strong&gt; — a massive win! 😍&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frxb9k7bspj5ne9895c7g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frxb9k7bspj5ne9895c7g.png" alt="Image description" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We optimized for speed, and it paid off—but what about storage size?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gridstore: 2333MB&lt;/li&gt;
&lt;li&gt;RocksDB: 2319MB
Strictly speaking, RocksDB is slightly smaller, but the difference is negligible compared to the 2x faster ingestion and more stable throughput. A small trade-off for a big performance gain! &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Trying Out Gridstore
&lt;/h2&gt;

&lt;p&gt;Gridstore represents a significant advancement in how Qdrant manages its &lt;strong&gt;key-value storage&lt;/strong&gt; needs. It offers great performance and streamlined updates tailored specifically for our use case. We have managed to achieve faster, more reliable data ingestion while maintaining data integrity, even under heavy workloads and unexpected failures. It is already used as a storage backend for on-disk payloads and sparse vectors.&lt;/p&gt;

&lt;p&gt;👉 It’s important to note that Gridstore remains tightly integrated with Qdrant and, as such, has not been released as a standalone crate. &lt;br&gt;
Its API is still evolving, and we are focused on refining it within our ecosystem to ensure maximum stability and performance. That said, we recognize the value this innovation could bring to the wider Rust community. In the future, once the API stabilizes and we decouple it enough from Qdrant, we will consider publishing it as a contribution to the community ❤️.&lt;/p&gt;

&lt;p&gt;For now, Gridstore continues to drive improvements in Qdrant, demonstrating the benefits of a custom-tailored storage engine designed with modern demands in mind. Stay tuned for further updates and potential community releases as we keep pushing the boundaries of performance and reliability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwbbltk1kbymxiqexsku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwbbltk1kbymxiqexsku.png" alt="Image description" width="800" height="200"&gt;&lt;/a&gt;&lt;br&gt;
Simple, efficient, and designed just for Qdrant.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>opensource</category>
      <category>database</category>
    </item>
    <item>
      <title>What is Agentic RAG? Building Agents with Qdrant</title>
      <dc:creator>Kacper Łukawski</dc:creator>
      <pubDate>Mon, 25 Nov 2024 18:19:49 +0000</pubDate>
      <link>https://dev.to/qdrant/what-is-agentic-rag-building-agents-with-qdrant-2pm0</link>
      <guid>https://dev.to/qdrant/what-is-agentic-rag-building-agents-with-qdrant-2pm0</guid>
      <description>&lt;p&gt;Standard &lt;a href="https://dev.to/articles/what-is-rag-in-ai/"&gt;Retrieval Augmented Generation&lt;/a&gt; follows a predictable, linear path: receive a query, retrieve relevant documents, and generate a response. In many cases that might be enough to solve a particular problem. In the worst case scenario, your LLM will just decide to not answer the question, because the context does not provide enough information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryfoc1hqtjy5u2lkplot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryfoc1hqtjy5u2lkplot.png" alt="Standard, linear RAG pipeline" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the other hand, we have agents. These systems are given more freedom to act, and can take multiple non-linear steps to achieve a certain goal. There isn't a single definition of what an agent is, but in general, it is an applicationthat uses LLM and usually some tools to communicate with the outside world. &lt;/p&gt;

&lt;p&gt;LLMs are used as decision-makers which decide what action to take next. Actions can be anything, but they are usually well-defined and limited to a certain set of possibilities. One of these actions might be to query a vector database, like Qdrant, to retrieve relevant documents, if the context is not enough to make a decision. &lt;/p&gt;

&lt;p&gt;However, RAG is just a single tool in the agent's arsenal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1acyiwpo6m8csvlajuy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1acyiwpo6m8csvlajuy1.png" alt="AI Agent" width="800" height="682"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Agentic RAG: Combining RAG with Agents
&lt;/h2&gt;

&lt;p&gt;Since the agent definition is vague, the concept of &lt;strong&gt;Agentic RAG&lt;/strong&gt; is also not well-defined. In general, it refers to the combination of RAG with agents. This allows the agent to use external knowledge sources to make decisions, and primarily to decide when the external knowledge is needed. &lt;/p&gt;

&lt;p&gt;We can describe a system as Agentic RAG if it breaks the &lt;br&gt;
linear flow of a standard RAG system, and gives the agent the ability to take multiple steps to achieve a goal.&lt;/p&gt;

&lt;p&gt;A simple router that chooses a path to follow is often described as the simplest form of an agent. Such a system has multiple paths with conditions describing when to take a certain path. In the context of Agentic RAG, the agent can decide to query a vector database if the context is not enough to answer, or skip the query if it's enough, or when the question refers to common knowledge. &lt;/p&gt;

&lt;p&gt;Alternatively, there might be multiple collections storing different kinds of information, and the agent can decide which collection to query based on the context. The key factor is that the decision of choosing a path is made by the LLM, which is the core of the agent. A routing agent never comes back to the previous step, so it's ultimately just a conditional decision-making system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffd5b7o9qmvo1a8wy6rvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffd5b7o9qmvo1a8wy6rvh.png" alt="Routing Agent" width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, routing is just the beginning. Agents can be much more complex, and extreme forms of agents can have completefreedom to act. In such cases, the agent is given a set of tools and can autonomously decide which ones to use, how to use them, and in which order. LLMs are asked to plan and execute actions, and the agent can take multiple steps to achieve a goal, including taking steps back if needed. Such a system does not have to follow a DAG structure (Directed Acyclic Graph), and can have loops that help to self-correct the decisions made in the past. &lt;/p&gt;

&lt;p&gt;An agentic RAG system built in that manner can have tools not only to query a vector database, but also to play with the query, summarize the&lt;br&gt;
results, or even generate new data to answer the question. Options are endless, but there are some common patterns that can be observed in the wild. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fie282zphjgn5nffkgkyn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fie282zphjgn5nffkgkyn.png" alt="Autonomous Agent" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Solving Information Retrieval Problems with LLMs
&lt;/h3&gt;

&lt;p&gt;Generally speaking, tools exposed in an agentic RAG system are used to solve information retrieval problems which are not new to the search community. LLMs have changed how we approach these problems, but the core of the problem remains the same. What kind of tools you can consider using in an agentic RAG? Here are some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Querying a vector database&lt;/strong&gt; - the most common tool used in agentic RAG systems. It allows the agent to retrieve relevant documents based on the query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query expansion&lt;/strong&gt; - a tool that can be used to improve the query. It can be used to add synonyms, correct typos, or even to generate new queries based on the original one.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s73cptbqh6xbol73lod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s73cptbqh6xbol73lod.png" alt="Query expansion example" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extracting filters&lt;/strong&gt; - vector search alone is sometimes not enough. In many cases, you might want to narrow down the results based on specific parameters. This extraction process can automatically identify relevant conditions from the query. Otherwise, your users would have to manually define these search constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyonl9p8qivfuqwtg3bx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyonl9p8qivfuqwtg3bx.png" alt="Extracting filters" width="800" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality judgement&lt;/strong&gt; - knowing the quality of the results for given query can be used to decide whether they are good enough to answer, or if the agent should take another step to improve them somehow. Alternatively it can also admit the failure to provide good response.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvpzgubn4ecxd66uvn81.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvpzgubn4ecxd66uvn81.png" alt="Quality judgement" width="800" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are just some of the examples, but the list is not exhaustive. For example, your LLM could possibly play with Qdrant search parameters or choose different methods to query it. An example? If your users are searching using some specific keywords, you may prefer sparse vectors to dense vectors, as they are more efficient in such cases. In that case you have to arm your agent with tools to decide when to use sparse vectors and when to use dense vectors. Agentaware of the collection structure can make such decisions easily.&lt;/p&gt;

&lt;p&gt;Each of these tools might be a separate agent on its own, and multi-agent systems are not uncommon. In such cases, agents can communicate with each other, and one agent can decide to use another agent to solve a particular problem.&lt;/p&gt;

&lt;p&gt;Pretty useful component of an agentic RAG is also a human in the loop, which can be used to correct the agent's decisions, or steer it in the right direction.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where are Agents Used?
&lt;/h2&gt;

&lt;p&gt;Agents are an interesting concept, but since they heavily rely on LLMs, they are not applicable to all problems. Using Large Language Models is expensive and tend to be slow, what in many cases, it's not worth the cost. Standard RAG involves just a single call to the LLM, and the response is generated in a predictable way. Agents, on the other hand, can take multiple steps, and the latency experienced by the user adds up. &lt;/p&gt;

&lt;p&gt;In many cases, it's not acceptable. Agentic RAG is probably not that widely applicable in ecommerce search, where the user expects a quick response, but might be fine for customer support, where the user is willing to wait a bit longer for a better answer.&lt;/p&gt;
&lt;h2&gt;
  
  
  Which Framework is Best?
&lt;/h2&gt;

&lt;p&gt;There are lots of frameworks available to build agents, and choosing the best one is not easy. It depends on your existing stack or the tools you are familiar with. Some of the most popular LLM libraries have already drifted towards the agent paradigm, and they are offering tools to build them. There are, however, some tools built primarily for agents development, so let's focus on them.&lt;/p&gt;
&lt;h3&gt;
  
  
  LangGraph
&lt;/h3&gt;

&lt;p&gt;Developed by the LangChain team, LangGraph seems like a natural extension for those who already use LangChain for building their RAG systems, and would like to start with agentic RAG. &lt;/p&gt;

&lt;p&gt;Surprisingly, LangGraph has nothing to do with Large Language Models on its own. It's a framework for building graph-based applications in which each &lt;strong&gt;node&lt;/strong&gt; is a step of the workflow. Each node takes an application &lt;strong&gt;state&lt;/strong&gt; as an input, and produces a modified state as an output. The state is then passed to the next node, and so on. &lt;strong&gt;Edges&lt;/strong&gt; between the nodes might be conditional what makes branching possible. Contrary to some DAG-based tool (i.e. Apache Airflow), LangGraph allows for loops in the graph, which makes it possible to implement cyclic workflows, so an agent can achieve self-reflection and self-correction. Theoretically, LangGraph can be used to build any kind of applications in a graph-based manner, not only LLM agents.&lt;/p&gt;

&lt;p&gt;Some of the strengths of LangGraph include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistence&lt;/strong&gt; - the state of the workflow graph is stored as a checkpoint. That happens at each so-called super-step (which is a single sequential node of a graph). It enables replying certain steps of the workflow, fault-tolerance, and including human-in-the-loop interactions. This mechanism also acts as a &lt;strong&gt;short-term memory&lt;/strong&gt;, accessible in a context of a particular workflow execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term memory&lt;/strong&gt; - LangGraph also has a concept of memories that are shared between different workflow runs. However, this mechanism has to explicitly handled by our nodes. &lt;strong&gt;Qdrant with its semantic search capabilities is often used as a long-term memory layer&lt;/strong&gt;. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent support&lt;/strong&gt; - while there is no separate concept of multi-agent systems in LangGraph, it's possible to create such an architecture by building a graph that includes multiple agents and some kind of supervisor that makes a decision which agent to use in a given situation. If a node might be anything, then it might be another agent as well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some other interesting features of LangGraph include the ability to visualize the graph, automate the retries of failed steps, and include human-in-the-loop interactions.&lt;/p&gt;

&lt;p&gt;A minimal example of an agentic RAG could improve the user query, e.g. by fixing typos, expanding it with synonyms, or even generating a new query based on the original one. The agent could then retrieve documents from a vector database based on the improved query, and generate a response. The LangGraph app implementing this approach could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing_extensions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.constants&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# The state of the agent includes at least the messages exchanged between the agent(s) 
&lt;/span&gt;    &lt;span class="c1"&gt;# and the user. It is, however, possible to include other information in the state, as 
&lt;/span&gt;    &lt;span class="c1"&gt;# it depends on the specific agent.
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseMessage&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;improve_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# Building a graph requires defining nodes and building the flow between them with edges.
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;improve_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;improve_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;improve_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;improve_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Compiling the graph performs some checks and prepares the graph for execution.
&lt;/span&gt;&lt;span class="n"&gt;compiled_graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Compiled graph might be invoked with the initial state to start.
&lt;/span&gt;&lt;span class="n"&gt;compiled_graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why Qdrant is the best vector database out there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node of the process is just a Python function that does certain operation. You can call an LLM of your choice inside of them, if you want to, but there is no assumption about the messages being created by any AI. &lt;strong&gt;LangGraph rather acts as a runtime that launches these functions in a specific order, and passes the state between them&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;While &lt;a href="https://www.langchain.com/langgraph" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt; integrates well with the LangChain ecosystem, it can be used independently. For teams looking for additional support and features, there's also a commercial offering called LangGraph Platform. The framework is available for both Python and JavaScript environments, making it possible to be used in different tech stacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  CrewAI
&lt;/h3&gt;

&lt;p&gt;CrewAI is another popular choice for building agents, including agentic RAG. It's a high-level framework that assumesthere are some LLM-based agents working together to achieve a common goal. That's where the "crew" in CrewAI comes from. CrewAI is designed with multi-agent systems in mind. Contrary to LangGraph, the developer does not create a graph of &lt;br&gt;
processing, but defines agents and their roles within the crew.&lt;/p&gt;

&lt;p&gt;Some of the key concepts of CrewAI include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent&lt;/strong&gt; - a unit that has a specific role and goal, controlled by an LLM. It can optionally use some external tools to communicate with the outside world, but generally steered by prompt we provide to the LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process&lt;/strong&gt; - currently either sequential or hierarchical. It defines how the task will be executed by the agents. In a sequential process, agents are executed one after another, while in a hierarchical process, agent is selected by the manager agent, which is responsible for making decisions about which agent to use in a given situation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roles and goals&lt;/strong&gt; - each agent has a certain role within the crew, and the goal it should aim to achieve. These are set when we define an agent and are used to make decisions about which agent to use in a given situation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; - an extensive memory system consists of short-term memory, long-term memory, entity memory, and contextual memory that combines the other three. There is also user memory for preferences and personalization. &lt;strong&gt;This is where Qdrant comes into play, as it might be used as a long-term memory layer.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CrewAI provides a rich set of tools integrated into the framework. That may be a huge advantage for those who want to combine RAG with e.g. code execution, or image generation. The ecosystem is rich, however brining your own tools is not a big deal, as CrewAI is designed to be extensible.&lt;/p&gt;

&lt;p&gt;A simple agentic RAG application implemented in CrewAI could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai.memory.entity.entity_memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EntityMemory&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai.memory.short_term.short_term_memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ShortTermMemory&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai.memory.storage.rag_storage&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RAGStorage&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QdrantStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RAGStorage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;response_generator_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate response based on the conversation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Provide the best response, or admit when the response is not available.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I am a response generator agent. I generate &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;responses based on the conversation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;query_reformulation_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reformulate the query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rewrite the query to get better results. Fix typos, grammar, word choice, etc.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I am a query reformulation agent. I reformulate the &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query to get better results.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Let me know why Qdrant is the best vector database out there.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3 bullet points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;crew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;response_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_reformulation_agent&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;entity_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;EntityMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;QdrantStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;short_term_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ShortTermMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;QdrantStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short-term&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kickoff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Disclaimer: QdrantStorage is not a part of the CrewAI framework, but it's taken from the Qdrant documentation on &lt;a href="https://qdrant.tech/documentation/frameworks/crewai/" rel="noopener noreferrer"&gt;how to integrate Qdrant with CrewAI&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Although it's not a technical advantage, CrewAI has a &lt;a href="https://docs.crewai.com/introduction" rel="noopener noreferrer"&gt;great documentation&lt;/a&gt;. The framework is available for Python, and it's easy to get started with it. CrewAI also has a commercial offering, CrewAI Enterprise, which provides a platform for building and deploying agents at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  AutoGen
&lt;/h3&gt;

&lt;p&gt;AutoGen emphasizes multi-agent architectures as a fundamental design principle. The framework requires at least two agents in any system to really call an application agentic - typically an assistant and a user proxy exchange messages to achieve a common goal. Sequential chat with more than two agents is also supported, as well as group chat and nested&lt;br&gt;
chat for internal dialogue. However, AutoGen does not assume there is a structured state that is passed between the agents, and the chat conversation is the only way to communicate between them.&lt;/p&gt;

&lt;p&gt;There are many interesting concepts in the framework, some of them even quite unique:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools/functions&lt;/strong&gt; - external components that can be used by agents to communicate with the outside world. They are defined as Python callables, and can be used for any external interaction we want to allow the agent to do. Type annotations are used to define the input and output of the tools, and Pydantic models are supported for more complex type schema. AutoGen supports only OpenAI-compatible tool call API for the time being.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code executors&lt;/strong&gt; - built-in code executors include local command, Docker command, and Jupyter. An agent can write and launch code, so theoretically the agents can do anything that can be done in Python. None of the other frameworks made code generation and execution that prominent. Code execution being the first-class citizen in AutoGen is an interesting concept.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each AutoGen agent uses at least one of the components: human-in-the-loop, code executor, tool executor, or LLM. A simple agentic RAG, based on the conversation of two agents which can retrieve documents from a vector database, or improve the query, could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;environ&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;autogen&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConversableAgent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;autogen.agentchat.contrib.retrieve_user_proxy_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RetrieveUserProxyAgent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;

&lt;span class="n"&gt;response_generator_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConversableAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_generator_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You answer user questions based solely on the provided context. You ask to retrieve relevant documents for &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your query, or reformulate the query, if it is incorrect in some way.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A response generator agent that can answer your queries.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}]},&lt;/span&gt;
    &lt;span class="n"&gt;human_input_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEVER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;user_proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RetrieveUserProxyAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval_user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}]},&lt;/span&gt;
    &lt;span class="n"&gt;human_input_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEVER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retrieve_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_token_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector_db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_or_create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_proxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initiate_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;response_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_proxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message_generator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;problem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why Qdrant is the best vector database out there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For those new to agent development, AutoGen offers AutoGen Studio, a low-code interface for prototyping agents. While not intended for production use, it significantly lowers the barrier to entry for experimenting with agent architectures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffl4f0daj2zg8tqpcy5rd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffl4f0daj2zg8tqpcy5rd.png" alt="AutoGen Studio" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's worth noting that AutoGen is currently undergoing significant updates, with version 0.4.x in development introducing substantial API changes compared to the stable 0.2.x release. While the framework currently has limited built-in persistence and state management capabilities, these features may evolve in future releases.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI Swarm
&lt;/h3&gt;

&lt;p&gt;Unliked the other frameworks described in this article, OpenAI Swarm is an educational project, and it's not ready for production use. It's worth mentioning, though, as it's pretty lightweight and easy to get started with. OpenAI Swarm is an experimental framework for orchestrating multi-agent workflows that focuses on agent coordination through direct handoffs rather than complex orchestration patterns.&lt;/p&gt;

&lt;p&gt;With that setup, &lt;strong&gt;agents&lt;/strong&gt; are just exchanging messages in a chat, optionally calling some Python functions to communicate with external services, or handing off the conversation to another agent, if the other one seems to be more suitable to answer the question. Each agent has a certain role, defined by the instructions we have to define.&lt;/p&gt;

&lt;p&gt;We have to decide which LLM will a particular agent use, and a set of functions it can call. For example, &lt;strong&gt;a retrieval agent could use a vector database to retrieve documents&lt;/strong&gt;, and return the results to the next agent. That means, there should be a function that performs the semantic search on its behalf, but the model will decide how the query should look like.&lt;/p&gt;

&lt;p&gt;Here is how a similar agentic RAG application, implemented in OpenAI Swarm, could look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;swarm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Swarm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Swarm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Retrieve documents based on the query.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transfer_to_query_improve_agent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;query_improve_agent&lt;/span&gt;

&lt;span class="n"&gt;query_improve_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query Improve Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a search expert that takes user queries and improves them to get better results. You fix typos and &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extend queries with synonyms, if needed. You never ask the user for more information.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response_generation_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Generation Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You take the whole conversation and generate a final response based on the chat history. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have enough information, you can retrieve the documents from the knowledge base or &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reformulate the query by transferring to other agent. You never ask the user for more information. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You have to always be the last participant of each conversation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;functions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;retrieve_documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transfer_to_query_improve_agent&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response_generation_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why Qdrant is the best vector database out there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though we don't explicitly define the graph of processing, the agents can still decide to hand off the processing to a different agent. There is no concept of a state, so everything relies on the messages exchanged between different components. &lt;/p&gt;

&lt;p&gt;OpenAI Swarm does not focus on integration with external tools, and &lt;strong&gt;if you would like to integrate semantic search with Qdrant, you would have to implement it fully yourself&lt;/strong&gt;. Obviously, the library is tightly coupled with OpenAI models, and while using some other ones is possible, it requires some additional work like setting up proxy that will &lt;br&gt;
adjust the interface to OpenAI API.&lt;/p&gt;

&lt;h3&gt;
  
  
  The winner?
&lt;/h3&gt;

&lt;p&gt;Choosing the best framework for your agentic RAG system depends on your existing stack, team expertise, and the specific requirements of your project. All the described tools are strong contenders, and they are developed at rapid pace. It's worth keeping an eye on all of them, as they are likely to evolve and improve over time. Eventually, you should be able to build the same processes with any of them, but some of them may be more suitable in a specific ecosystem of the tools you want your agent to interact with.&lt;/p&gt;

&lt;p&gt;There are, however, some important factors to consider when choosing a framework for your agentic RAG system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop&lt;/strong&gt; - even though we aim to build autonomous agents, it's often important to include the feedback from the human, so our agents cannot perform malicious actions. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; - how easy it is to debug the system, and how easy it is to understand what's happening inside. Especially important, since we are dealing with lots of LLM prompts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Still, choosing the right toolkit depends on the state of your project, and the specific requirements you have. If you want to integrate your agent with number of external tools, CrewAI might be the best choice, as the set of out-of-the-box integrations is the biggest. However, LangGraph integrates well with LangChain, so if you are familiar with that ecosystem, it may suit you better. &lt;/p&gt;

&lt;p&gt;All the frameworks have different approaches to building agents, so it's worth experimenting with all of them to see which one fits your needs the best. LangGraph and CrewAI are more mature and have more features, while AutoGen and OpenAI Swarm are more lightweight and more experimental. However, &lt;strong&gt;none of the existing frameworks solves all the mentioned Information Retrieval problems&lt;/strong&gt;, so you still have to build your own tools to fill the gaps. &lt;/p&gt;

&lt;h2&gt;
  
  
  Building Agentic RAG with Qdrant
&lt;/h2&gt;

&lt;p&gt;No matter which framework you choose, Qdrant is a great tool to build agentic RAG systems. Please check out &lt;a href="https://dev.to/documentation/frameworks/"&gt;our integrations&lt;/a&gt; to choose the best one for your use case and preferences. The easiest way to start using Qdrant is to use our managed service, &lt;a href="https://cloud.qdrant.io" rel="noopener noreferrer"&gt;Qdrant Cloud&lt;/a&gt;. A free 1GB cluster is &lt;br&gt;
available for free, so you can start building your agentic RAG system in minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;p&gt;See how Qdrant integrates with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/autogen/" rel="noopener noreferrer"&gt;Autogen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/crewai/" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/langgraph/" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/frameworks/swarm/" rel="noopener noreferrer"&gt;Swarm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>qdrant</category>
      <category>rag</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>What is Vector Quantization?</title>
      <dc:creator>Sabrina</dc:creator>
      <pubDate>Fri, 27 Sep 2024 16:14:21 +0000</pubDate>
      <link>https://dev.to/qdrant/what-is-vector-quantization-nna</link>
      <guid>https://dev.to/qdrant/what-is-vector-quantization-nna</guid>
      <description>&lt;p&gt;Vector quantization is a data compression technique used to reduce the size of high-dimensional data. Compressing vectors reduces memory usage while maintaining nearly all of the essential information. This method allows for more efficient storage and faster search operations, particularly in large datasets.&lt;/p&gt;

&lt;p&gt;When working with high-dimensional vectors, such as embeddings from providers like OpenAI, a single 1536-dimensional vector requires &lt;strong&gt;6 KB of memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyngfp8rta3lrolbsq7um.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyngfp8rta3lrolbsq7um.png" alt="1536 dimentional vector of 6KB of size" width="742" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With 1 million vectors needing around 6 GB of memory, as your dataset grows to multiple &lt;strong&gt;millions of vectors&lt;/strong&gt;, the memory and processing demands increase significantly.&lt;/p&gt;

&lt;p&gt;To understand why this process is so computationally demanding, let's take a look at the nature of the &lt;a href="https://qdrant.tech/documentation/concepts/indexing/#vector-index" rel="noopener noreferrer"&gt;HNSW index&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;HNSW (Hierarchical Navigable Small World) index&lt;/strong&gt; organizes vectors in a layered graph, connecting each vector to its nearest neighbors. At each layer, the algorithm narrows down the search area until it reaches the lower layers, where it efficiently finds the closest matches to the query.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxme94wiq18xae15xqup.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxme94wiq18xae15xqup.png" alt="HNSW representation" width="723" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each time a new vector is added, the system must determine its position in the existing graph, a process similar to searching. This makes both inserting and searching for vectors complex operations.&lt;/p&gt;

&lt;p&gt;One of the key challenges with the HNSW index is that it requires a lot of &lt;strong&gt;random reads&lt;/strong&gt; and &lt;strong&gt;sequential traversals&lt;/strong&gt; through the graph. This makes the process computationally expensive, especially when you're dealing with millions of high-dimensional vectors.&lt;/p&gt;

&lt;p&gt;The system has to jump between various points in the graph in an unpredictable manner. This unpredictability makes optimization difficult, and as the dataset grows, the memory and processing requirements increase significantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvu4zkyo9st9j55s2j59.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvu4zkyo9st9j55s2j59.png" alt="HNSW random searches example" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since vectors need to be stored in &lt;strong&gt;fast storage&lt;/strong&gt; like &lt;strong&gt;RAM&lt;/strong&gt; or &lt;strong&gt;SSD&lt;/strong&gt; for low-latency searches, as the size of the data grows, so does the cost of storing and processing it efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantization&lt;/strong&gt; offers a solution by compressing vectors to smaller memory sizes, making the process more efficient.&lt;/p&gt;

&lt;p&gt;There are several methods to achieve this, and here we will focus on three main ones:&lt;/p&gt;

&lt;p&gt;&lt;a href="/articles_data/what-is-vector-quantization/types-of-quant.png" class="article-body-image-wrapper"&gt;&lt;img src="/articles_data/what-is-vector-quantization/types-of-quant.png" alt="Types of Quantization: 1. Scalar Quantization, 2. Product Quantization, 3. Binary Quantization"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. What is Scalar Quantization?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflk1e2v73ih1jcrdssqb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflk1e2v73ih1jcrdssqb.png" alt="Astronaut holding cube" width="800" height="197"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In Qdrant, each dimension is represented by a &lt;code&gt;float32&lt;/code&gt; value, which uses &lt;strong&gt;4 bytes&lt;/strong&gt; of memory. When using &lt;a href="https://qdrant.tech/documentation/guides/quantization/#scalar-quantization" rel="noopener noreferrer"&gt;Scalar Quantization&lt;/a&gt;, we map our vectors to a range that the smaller &lt;code&gt;int8&lt;/code&gt; type can represent. An &lt;code&gt;int8&lt;/code&gt; is only &lt;strong&gt;1 byte&lt;/strong&gt; and can represent 256 values (from -128 to 127, or 0 to 255). This results in a &lt;strong&gt;75% reduction&lt;/strong&gt; in memory size.&lt;/p&gt;

&lt;p&gt;For example, if our data lies in the range of -1.0 to 1.0, Scalar Quantization will transform these values to a range that &lt;code&gt;int8&lt;/code&gt; can represent, i.e., within -128 to 127. The system &lt;strong&gt;maps&lt;/strong&gt; the &lt;code&gt;float32&lt;/code&gt; values into this range.&lt;/p&gt;

&lt;p&gt;Here's a simple linear example of what this process looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdax8eqlo5jw8sgam7qli.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdax8eqlo5jw8sgam7qli.png" alt="Scalar Quantization example" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To set up Scalar Quantization in Qdrant, you need to include the &lt;code&gt;quantization_config&lt;/code&gt; section when creating or updating a collection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/{collection_name}
{
    "vectors": {
      "size": 128,
      "distance": "Cosine"
    },
    "quantization_config": {
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": true
        }
    }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;quantile&lt;/code&gt; parameter is used to calculate the quantization bounds. For example, if you specify a &lt;code&gt;0.99&lt;/code&gt; quantile, the most extreme 1% of values will be excluded from the quantization bounds.&lt;/p&gt;

&lt;p&gt;This parameter only affects the resulting precision, not the memory footprint. You can adjust it if you experience a significant decrease in search quality.&lt;/p&gt;

&lt;p&gt;Scalar Quantization is a great choice if you're looking to boost search speed and compression without losing much accuracy. It also slightly improves performance, as distance calculations (such as dot product or cosine similarity) using &lt;code&gt;int8&lt;/code&gt; values are computationally simpler than using &lt;code&gt;float32&lt;/code&gt; values.&lt;/p&gt;

&lt;p&gt;While the performance gains of Scalar Quantization may not match those achieved with Binary Quantization (which we'll discuss later), it remains an excellent default choice when Binary Quantization isn’t suitable for your use case.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. What is Binary Quantization?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkav5g7facj83ggyugpan.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkav5g7facj83ggyugpan.png" alt="Astronaut in surreal white environment" width="800" height="227"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://qdrant.tech/documentation/guides/quantization/#binary-quantization" rel="noopener noreferrer"&gt;Binary Quantization&lt;/a&gt; is an excellent option if you're looking to &lt;strong&gt;reduce memory&lt;/strong&gt; usage while also achieving a significant &lt;strong&gt;boost in speed&lt;/strong&gt;. It works by converting high-dimensional vectors into simple binary (0 or 1) representations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Values greater than zero are converted to 1.&lt;/li&gt;
&lt;li&gt;Values less than or equal to zero are converted to 0.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's consider our initial example of a 1536-dimensional vector that requires &lt;strong&gt;6 KB&lt;/strong&gt; of memory (4 bytes for each &lt;code&gt;float32&lt;/code&gt; value).&lt;/p&gt;

&lt;p&gt;After Binary Quantization, each dimension is reduced to 1 bit (1/8 byte), so the memory required is:&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;1536 dimensions8 bits per byte=192 bytes
\frac{1536 \text{ dimensions}}{8 \text{ bits per byte}} = 192 \text{ bytes}
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;8&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt; bits per byte&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1536&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt; dimensions&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;192&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt; bytes&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;



&lt;p&gt;This leads to a &lt;strong&gt;32x&lt;/strong&gt; memory reduction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxrq191v6pv8mqa21jzg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxrq191v6pv8mqa21jzg.png" alt="Binary Quantization example" width="800" height="231"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qdrant automates the Binary Quantization process during indexing. As vectors are added to your collection, each 32-bit floating-point component is converted into a binary value according to the configuration you define.&lt;/p&gt;

&lt;p&gt;Here’s how you can set it up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "quantization_config": {
        "binary": {
            "always_ram": true
        }
    }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Binary Quantization is by far the quantization method that provides the most significant processing &lt;strong&gt;speed gains&lt;/strong&gt; compared to Scalar and Product Quantizations. This is because the binary representation allows the system to use highly optimized CPU instructions, such as &lt;a href="https://en.wikipedia.org/wiki/XOR_gate#:~:text=XOR%20represents%20the%20inequality%20function,the%20other%20but%20not%20both%22" rel="noopener noreferrer"&gt;XOR&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Hamming_weight" rel="noopener noreferrer"&gt;Popcount&lt;/a&gt;, for fast distance computations.&lt;/p&gt;

&lt;p&gt;It can speed up search operations by &lt;strong&gt;up to 40x&lt;/strong&gt;, depending on the dataset and hardware.&lt;/p&gt;

&lt;p&gt;Not all models are equally compatible with Binary Quantization, and in the comparison above, we are only using models that are compatible. Some models may experience a greater loss in accuracy when quantized. We recommend using Binary Quantization with models that have &lt;strong&gt;at least 1024 dimensions&lt;/strong&gt; to minimize accuracy loss.&lt;/p&gt;

&lt;p&gt;The models that have shown the best compatibility with this method include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI text-embedding-ada-002&lt;/strong&gt; (1536 dimensions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cohere AI embed-english-v2.0&lt;/strong&gt; (4096 dimensions)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These models demonstrate minimal accuracy loss while still benefiting from substantial speed and memory gains.&lt;/p&gt;

&lt;p&gt;Even though Binary Quantization is incredibly fast and memory-efficient, the trade-offs are in &lt;strong&gt;precision&lt;/strong&gt; and &lt;strong&gt;model compatibility&lt;/strong&gt;, so you may need to ensure search quality using techniques like oversampling and rescoring.&lt;/p&gt;

&lt;p&gt;If you're interested in exploring Binary Quantization in more detail—including implementation examples, benchmark results, and usage recommendations—check out our dedicated article on &lt;a href="https://qdrant.tech/articles/binary-quantization/" rel="noopener noreferrer"&gt;Binary Quantization - Vector Search, 40x Faster&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. What is Product Quantization?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdefp3dfh6s565lpls7wx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdefp3dfh6s565lpls7wx.png" alt="Astronaut with centroids" width="800" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://qdrant.tech/documentation/guides/quantization/#product-quantization" rel="noopener noreferrer"&gt;Product Quantization&lt;/a&gt; is a method used to compress high-dimensional vectors by representing them with a smaller set of representative points.&lt;/p&gt;

&lt;p&gt;The process begins by splitting the original high-dimensional vectors into smaller &lt;strong&gt;sub-vectors.&lt;/strong&gt; Each sub-vector represents a segment of the original vector, capturing different characteristics of the data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjy9jtvmlsn8pqr93xtwp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjy9jtvmlsn8pqr93xtwp.png" alt="Creation of the Sub-vector" width="800" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For each sub-vector, a separate &lt;strong&gt;codebook&lt;/strong&gt; is created, representing regions in the data space where common patterns occur.&lt;/p&gt;

&lt;p&gt;The codebook in Qdrant is trained automatically during the indexing process. As vectors are added to the collection, Qdrant uses your specified quantization settings in the &lt;code&gt;quantization_config&lt;/code&gt; to build the codebook and quantize the vectors. Here’s how you can set it up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1024,
      "distance": "Cosine"
    },
    "quantization_config": {
        "product": {
            "compression": "x32",
            "always_ram": true
        }
    }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each region in the codebook is defined by a &lt;strong&gt;centroid&lt;/strong&gt;, which serves as a representative point summarizing the characteristics of that region. Instead of treating every single data point as equally important, we can group similar sub-vectors together and represent them with a single centroid that captures the general characteristics of that group.&lt;/p&gt;

&lt;p&gt;The centroids used in Product Quantization are determined using the &lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/K-means_clustering" rel="noopener noreferrer"&gt;K-means clustering algorithm&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4sgobztmeyzsv31kd2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4sgobztmeyzsv31kd2g.png" alt="Codebook and Centroids example" width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qdrant always selects &lt;strong&gt;K = 256&lt;/strong&gt; as the number of centroids in its implementation, based on the fact that 256 is the maximum number of unique values that can be represented by a single byte.&lt;/p&gt;

&lt;p&gt;This makes the compression process efficient because each centroid index can be stored in a single byte.&lt;/p&gt;

&lt;p&gt;The original high-dimensional vectors are quantized by mapping each sub-vector to the nearest centroid in its respective codebook.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7zjbzxr5tr7xwnge3xu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7zjbzxr5tr7xwnge3xu.png" alt="Vectors being mapped to their corresponding centroids example" width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The compressed vector stores the index of the closest centroid for each sub-vector.&lt;/p&gt;

&lt;p&gt;Here’s how a 1024-dimensional vector, originally taking up 4096 bytes, is reduced to just 128 bytes by representing it as 128 indexes, each pointing to the centroid of a sub-vector:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp37escdizqs8srip6pr2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp37escdizqs8srip6pr2.png" alt="Product Quantization example" width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After setting up quantization and adding your vectors, you can perform searches as usual. Qdrant will automatically use the quantized vectors, optimizing both speed and memory usage. Optionally, you can enable rescoring for better accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /collections/{collection_name}/points/search
{
    "query": [0.22, -0.01, -0.98, 0.37],
    "params": {
        "quantization": {
            "rescore": true
        }
    },
    "limit": 10
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Product Quantization can significantly reduce memory usage, potentially offering up to &lt;strong&gt;64x&lt;/strong&gt; compression in certain configurations. However, it's important to note that this level of compression can lead to a noticeable drop in quality.&lt;/p&gt;

&lt;p&gt;If your application requires high precision or real-time performance, Product Quantization may not be the best choice. However, if &lt;strong&gt;memory savings&lt;/strong&gt; are critical and some accuracy loss is acceptable, it could still be an ideal solution.&lt;/p&gt;

&lt;p&gt;Here’s a comparison of speed, accuracy, and compression for all three methods, adapted from &lt;a href="https://qdrant.tech/documentation/guides/quantization/#how-to-choose-the-right-quantization-method" rel="noopener noreferrer"&gt;Qdrant's documentation&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quantization method&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Compression&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scalar&lt;/td&gt;
&lt;td&gt;0.99&lt;/td&gt;
&lt;td&gt;up to x2&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product&lt;/td&gt;
&lt;td&gt;0.7&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;up to 64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary&lt;/td&gt;
&lt;td&gt;0.95*&lt;/td&gt;
&lt;td&gt;up to x40&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;* - for compatible models&lt;/p&gt;

&lt;p&gt;For a more in-depth understanding of the benchmarks you can expect, check out our dedicated article on &lt;a href="https://qdrant.tech/articles/product-quantization/" rel="noopener noreferrer"&gt;Product Quantization in Vector Search&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rescoring, Oversampling, and Reranking
&lt;/h2&gt;

&lt;p&gt;When we use quantization methods like Scalar, Binary, or Product Quantization, we're compressing our vectors to save memory and improve performance. However, this compression removes some detail from the original vectors.&lt;/p&gt;

&lt;p&gt;This can slightly reduce the accuracy of our similarity searches because the quantized vectors are approximations of the original data. To mitigate this loss of accuracy, you can use &lt;strong&gt;oversampling&lt;/strong&gt; and &lt;strong&gt;rescoring&lt;/strong&gt;, which help improve the accuracy of the final search results.&lt;/p&gt;

&lt;p&gt;The original vectors are never deleted during this process, and you can easily switch between quantization methods or parameters by updating the collection configuration at any time.&lt;/p&gt;

&lt;p&gt;Here’s how the process works, step by step:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Initial Quantized Search
&lt;/h3&gt;

&lt;p&gt;When you perform a search, Qdrant retrieves the top candidates using the quantized vectors based on their similarity to the query vector, as determined by the quantized data. This step is fast because we're using the quantized vectors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkzu1flsaiy7pmwwqgyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkzu1flsaiy7pmwwqgyt.png" alt="ANN Search with Quantization" width="800" height="601"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Oversampling
&lt;/h3&gt;

&lt;p&gt;Oversampling is a technique that helps compensate for any precision lost due to quantization. Since quantization simplifies vectors, some relevant matches could be missed in the initial search. To avoid this, you can &lt;strong&gt;retrieve more candidates&lt;/strong&gt;, increasing the chances that the most relevant vectors make it into the final results.&lt;/p&gt;

&lt;p&gt;You can control the number of extra candidates by setting an &lt;code&gt;oversampling&lt;/code&gt; parameter. For example, if your desired number of results (&lt;code&gt;limit&lt;/code&gt;) is 4 and you set an &lt;code&gt;oversampling&lt;/code&gt; factor of 2, Qdrant will retrieve 8 candidates (4 × 2).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjv52syp3swcfq5m2f9v7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjv52syp3swcfq5m2f9v7.png" alt="ANN Search with Quantization and Oversampling" width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can adjust the oversampling factor to control how many extra vectors Qdrant includes in the initial pool. More candidates mean a better chance of obtaining high-quality top-K results, especially after rescoring with the original vectors.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Rescoring with Original Vectors
&lt;/h3&gt;

&lt;p&gt;After oversampling to gather more potential matches, each candidate is re-evaluated based on additional criteria to ensure higher accuracy and relevance to the query.&lt;/p&gt;

&lt;p&gt;The rescoring process &lt;strong&gt;maps&lt;/strong&gt; the quantized vectors to their corresponding original vectors, allowing you to consider factors like context, metadata, or additional relevance that wasn’t included in the initial search, leading to more accurate results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xofzja0fqb4xutgqqaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xofzja0fqb4xutgqqaz.png" alt="Rescoring with Original Vectors" width="800" height="328"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During rescoring, one of the lower-ranked candidates from oversampling might turn out to be a better match than some of the original top-K candidates.&lt;/p&gt;

&lt;p&gt;Even though rescoring uses the original, larger vectors, the process remains much faster because only a very small number of vectors are read. The initial quantized search already identifies the specific vectors to read, rescore, and rerank.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Reranking
&lt;/h3&gt;

&lt;p&gt;With the new similarity scores from rescoring, &lt;strong&gt;reranking&lt;/strong&gt; is where the final top-K candidates are determined based on the updated similarity scores.&lt;/p&gt;

&lt;p&gt;For example, in our case with a limit of 4, a candidate that ranked 6th in the initial quantized search might improve its score after rescoring because the original vectors capture more context or metadata. As a result, this candidate could move into the final top 4 after reranking, replacing a less relevant option from the initial search.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff45cru5len2fprc22rx9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff45cru5len2fprc22rx9.png" alt="Reranking with Original Vectors" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's how you can set it up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /collections/{collection_name}/points/search
{
  "query": [0.22, -0.01, -0.98, 0.37],
  "params": {
    "quantization": {
      "rescore": true,
      "oversampling": 2
    }
  },
  "limit": 4
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can adjust the &lt;code&gt;oversampling&lt;/code&gt; factor to find the right balance between search speed and result accuracy.&lt;/p&gt;

&lt;p&gt;If quantization is impacting performance in an application that requires high accuracy, combining oversampling with rescoring is a great choice. However, if you need faster searches and can tolerate some loss in accuracy, you might choose to use oversampling without rescoring, or adjust the oversampling factor to a lower value.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributing Resources Between Disk &amp;amp; Memory
&lt;/h2&gt;

&lt;p&gt;Qdrant stores both the quantized and original vectors. When you enable quantization, both the original and quantized vectors are stored in RAM by default. You can move the original vectors to disk to significantly reduce RAM usage and lower system costs. Simply enabling quantization is not enough—you need to explicitly move the original vectors to disk by setting &lt;code&gt;on_disk=True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here’s an example configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/{collection_name}
{
  "vectors": {
    "size": 1536,
    "distance": "Cosine",
    "on_disk": true  # Move original vectors to disk
  },
  "quantization_config": {
    "binary": {
      "always_ram": true  # Store only quantized vectors in RAM
    }
  }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without explicitly setting &lt;code&gt;on_disk=True&lt;/code&gt;, you won't see any RAM savings, even with quantization enabled. So, make sure to configure both storage and quantization options based on your memory and performance needs. If your storage has high disk latency, consider disabling rescoring to maintain speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speeding Up Rescoring with io_uring
&lt;/h3&gt;

&lt;p&gt;When dealing with large collections of quantized vectors, frequent disk reads are required to retrieve both original and compressed data for rescoring operations. While &lt;code&gt;mmap&lt;/code&gt; helps with efficient I/O by reducing user-to-kernel transitions, rescoring can still be slowed down when working with large datasets on disk due to the need for frequent disk reads.&lt;/p&gt;

&lt;p&gt;On Linux-based systems, &lt;code&gt;io_uring&lt;/code&gt; allows multiple disk operations to be processed in parallel, significantly reducing I/O overhead. This optimization is particularly effective during rescoring, where multiple vectors need to be re-evaluated after the initial search. With io_uring, Qdrant can retrieve and rescore vectors from disk in the most efficient way, improving overall search performance.&lt;/p&gt;

&lt;p&gt;When you perform vector quantization and store data on disk, Qdrant often needs to access multiple vectors in parallel. Without io_uring, this process can be slowed down due to the system’s limitations in handling many disk accesses.&lt;/p&gt;

&lt;p&gt;To enable &lt;code&gt;io_uring&lt;/code&gt; in Qdrant, add the following to your storage configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;async_scorer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Enable io_uring for async storage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this configuration, Qdrant will default to using &lt;code&gt;mmap&lt;/code&gt; for disk I/O operations.&lt;/p&gt;

&lt;p&gt;For more information and benchmarks comparing io_uring with traditional I/O approaches like mmap, check out &lt;a href="https://qdrant.tech/articles/io_uring/" rel="noopener noreferrer"&gt;Qdrant's io_uring implementation article.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance of Quantized vs. Non-Quantized Data
&lt;/h2&gt;

&lt;p&gt;Qdrant uses the quantized vectors by default if they are available. If you want to evaluate how quantization affects your search results, you can temporarily disable it to compare results from quantized and non-quantized searches. To do this, set &lt;code&gt;ignore: true&lt;/code&gt; in the query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /collections/{collection_name}/points/query
{
    "query": [0.22, -0.01, -0.98, 0.37],
    "params": {
        "quantization": {
            "ignore": true,
        }
    },
    "limit": 4
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Switching Between Quantization Methods
&lt;/h3&gt;

&lt;p&gt;Not sure if you’ve chosen the right quantization method? In Qdrant, you have the flexibility to remove quantization and rely solely on the original vectors, adjust the quantization type, or change compression parameters at any time without affecting your original vectors.&lt;/p&gt;

&lt;p&gt;To switch to binary quantization and adjust the compression rate, for example, you can update the collection’s quantization configuration using the &lt;code&gt;update_collection&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/{collection_name}
{
  "vectors": {
    "size": 1536,
    "distance": "Cosine"
  },
  "quantization_config": {
    "binary": {
      "always_ram": true,
      "compression_rate": 0.8  # Set the new compression rate
    }
  }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you decide to &lt;strong&gt;turn off quantization&lt;/strong&gt; and use only the original vectors, you can remove the quantization settings entirely with &lt;code&gt;quantization_config=None&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/my_collection
{
  "vectors": {
    "size": 1536,
    "distance": "Cosine"
  },
  "quantization_config": null  # Remove quantization and use original vectors only
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7b9hbwu41bmnt0141tu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7b9hbwu41bmnt0141tu1.png" alt="Astronaut leaving" width="800" height="193"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Quantization methods like Scalar, Product, and Binary Quantization offer powerful ways to optimize memory usage and improve search performance when dealing with large datasets of high-dimensional vectors. Each method comes with its own trade-offs between memory savings, computational speed, and accuracy.&lt;/p&gt;

&lt;p&gt;Here are some final thoughts to help you choose the right quantization method for your needs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Quantization Method&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;When to Use&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Binary Quantization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;• &lt;strong&gt;Fastest method and most memory-efficient&lt;/strong&gt;&lt;br&gt;•  Up to &lt;strong&gt;40x&lt;/strong&gt; faster search and &lt;strong&gt;32x&lt;/strong&gt; reduced memory footprint&lt;/td&gt;
&lt;td&gt;• Use with tested models like OpenAI's &lt;code&gt;text-embedding-ada-002&lt;/code&gt; and Cohere's &lt;code&gt;embed-english-v2.0&lt;/code&gt;&lt;br&gt;• When speed and memory efficiency are critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalar Quantization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;• &lt;strong&gt;Minimal loss of accuracy&lt;/strong&gt;&lt;br&gt;•  Up to &lt;strong&gt;4x&lt;/strong&gt; reduced memory footprint&lt;/td&gt;
&lt;td&gt;• Safe default choice for most applications.&lt;br&gt;• Offers a good balance between accuracy, speed, and compression.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Product Quantization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;• &lt;strong&gt;Highest compression ratio&lt;/strong&gt;&lt;br&gt;• Up to &lt;strong&gt;64x&lt;/strong&gt; reduced memory footprint&lt;/td&gt;
&lt;td&gt;• When minimizing memory usage is the top priority&lt;br&gt;• Acceptable if some loss of accuracy and slower indexing is tolerable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Learn More
&lt;/h3&gt;

&lt;p&gt;If you want to learn more about improving accuracy, memory efficiency, and speed when using quantization in Qdrant, we have a dedicated &lt;a href="https://qdrant.tech/documentation/guides/quantization/#quantization-tips" rel="noopener noreferrer"&gt;Quantization tips&lt;/a&gt; section in our docs that explains all the quantization tips you can use to enhance your results.&lt;/p&gt;

&lt;p&gt;Learn more about optimizing real-time precision with oversampling in Binary Quantization by watching this interview with Qdrant’s CTO, Andrey Vasnetsov:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/4aUq5VnR_VI"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Stay up-to-date on the latest in &lt;a href="https://dev.to/advanced-search/"&gt;vector search&lt;/a&gt; and quantization, share your projects, ask questions, &lt;a href="https://discord.com/invite/qdrant" rel="noopener noreferrer"&gt;join our vector search community&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>database</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>A Complete Guide to Filtering in Vector Search</title>
      <dc:creator>Sabrina</dc:creator>
      <pubDate>Thu, 12 Sep 2024 14:45:10 +0000</pubDate>
      <link>https://dev.to/qdrant/a-complete-guide-to-filtering-in-vector-search-33lk</link>
      <guid>https://dev.to/qdrant/a-complete-guide-to-filtering-in-vector-search-33lk</guid>
      <description>&lt;p&gt;Imagine you sell computer hardware. To help shoppers easily find products on your website, you need to have a &lt;strong&gt;user-friendly &lt;a href="https://qdrant.tech" rel="noopener noreferrer"&gt;search engine&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0vwfd8cpy0698f5mhq6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0vwfd8cpy0698f5mhq6.png" alt="vector-search-ecommerce" width="800" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’re selling computers and have extensive data on laptops, desktops, and accessories, your search feature should guide customers to the exact device they want - or a &lt;strong&gt;very similar&lt;/strong&gt; match needed.&lt;/p&gt;

&lt;p&gt;When storing data in Qdrant, each product is a point, consisting of an &lt;code&gt;id&lt;/code&gt;, a &lt;code&gt;vector&lt;/code&gt; and &lt;code&gt;payload&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
  &lt;/span&gt;&lt;span class="nl"&gt;"vector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;899.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"laptop"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;id&lt;/code&gt; is a unique identifier for the point in your collection. The &lt;code&gt;vector&lt;/code&gt; is a mathematical representation of similarity to other points in the collection. &lt;br&gt;
Finally, the &lt;code&gt;payload&lt;/code&gt; holds metadata that directly describes the point.&lt;/p&gt;

&lt;p&gt;Though we may not be able to decipher the vector, we are able to derive additional information about the item from its metadata, In this specific case, &lt;strong&gt;we are looking at a data point for a laptop that costs $899.99&lt;/strong&gt;. &lt;/p&gt;
&lt;h2&gt;
  
  
  What is filtering?
&lt;/h2&gt;

&lt;p&gt;When searching for the perfect computer, your customers may end up with results that are mathematically similar to the search entry, but not exact. For example, if they are searching for &lt;strong&gt;laptops under $1000&lt;/strong&gt;, a simple &lt;a href="https://dev.to/advanced-search/"&gt;vector search&lt;/a&gt; without constraints might still show other laptops over $1000. &lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://dev.to/advanced-search/"&gt;semantic search&lt;/a&gt; alone &lt;strong&gt;may not be enough&lt;/strong&gt;. In order to get the exact result, you would need to enforce a payload filter on the &lt;code&gt;price&lt;/code&gt;. Only then can you be sure that the search results abide by the chosen characteristic.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is called &lt;strong&gt;filtering&lt;/strong&gt; and it is one of the key features of &lt;a href="https://qdrant.tech" rel="noopener noreferrer"&gt;vector databases&lt;/a&gt;. &lt;br&gt;
Here is how a &lt;strong&gt;filtered vector search&lt;/strong&gt; looks behind the scenes. We'll cover its mechanics in the following section.&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /collections/online_store/points/search
{
  "vector": [ 0.2, 0.1, 0.9, 0.7 ],
  "filter": {
    "must": [
      {
        "key": "category",
        "match": { "value": "laptop" }
      },
      {
        "key": "price",
        "range": {
          "gt": null,
          "gte": null,
          "lt": null,
          "lte": 1000
        }
      }
    ]
  },
  "limit": 3,
  "with_payload": true,
  "with_vector": false
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The filtered result will be a combination of the semantic search and the filtering conditions imposed upon the query. In the following pages, we will show that &lt;strong&gt;filtering is a key practice in vector search for two reasons:&lt;/strong&gt; &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;With filtering, you can &lt;strong&gt;dramatically increase search precision&lt;/strong&gt;. More on this in the next section.&lt;/li&gt;
&lt;li&gt;Filtering helps control resources and &lt;strong&gt;reduce compute use&lt;/strong&gt;. More on this in &lt;strong&gt;Payload Indexing&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What you will learn in this guide:
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/advanced-search/"&gt;vector search&lt;/a&gt;, filtering and sorting are more interdependent than they are in traditional databases. While databases like SQL use commands such as &lt;code&gt;WHERE&lt;/code&gt; and &lt;code&gt;ORDER BY&lt;/code&gt;, the interplay between these processes in vector search is a bit more complex.&lt;/p&gt;

&lt;p&gt;Most people use default settings and build vector search apps that aren't properly configured or even setup for precise retrieval. In this guide, we will show you how to &lt;strong&gt;use filtering to get the most out of vector search&lt;/strong&gt; with some basic and advanced strategies that are easy to implement. &lt;/p&gt;

&lt;h3&gt;
  
  
  Remember to run all tutorial code in Qdrant's Dashboard
&lt;/h3&gt;

&lt;p&gt;The easiest way to reach that "Hello World" moment is to &lt;a href="https://dev.to/documentation/quickstart-cloud/"&gt;&lt;strong&gt;try filtering in a live cluster&lt;/strong&gt;&lt;/a&gt;. Our interactive tutorial will show you how to create a cluster, add data and try some filtering clauses. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyvsxdq1qtak3kxt45y6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyvsxdq1qtak3kxt45y6.png" alt="qdrant-filtering-tutorial" width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Qdrant's approach to filtering
&lt;/h2&gt;

&lt;p&gt;Qdrant follows a specific method of searching and filtering through dense vectors. &lt;/p&gt;

&lt;p&gt;Let's take a look at this &lt;strong&gt;3-stage diagram&lt;/strong&gt;. In this case, we are trying to find the nearest neighbour to the query vector &lt;strong&gt;(green)&lt;/strong&gt;. Your search journey starts at the bottom &lt;strong&gt;(orange)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By default, Qdrant connects all your data points within the &lt;a href="https://dev.to/documentation/concepts/indexing/"&gt;&lt;strong&gt;vector index&lt;/strong&gt;&lt;/a&gt;. After you &lt;a href="https://dev.to/documentation/concepts/filtering/"&gt;&lt;strong&gt;introduce filters&lt;/strong&gt;&lt;/a&gt;, some data points become disconnected. Vector search can't cross the grayed out area and it won't reach the nearest neighbor. &lt;br&gt;
How can we bridge this gap? &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; How Qdrant maintains a filterable vector index. &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkw5ev6l3u7bzp8861sfp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkw5ev6l3u7bzp8861sfp.png" alt="filterable-vector-index" width="800" height="377"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/documentation/concepts/indexing/"&gt;&lt;strong&gt;Filterable vector index&lt;/strong&gt;&lt;/a&gt;: This technique builds additional links &lt;strong&gt;(orange)&lt;/strong&gt; between leftover data points. The filtered points which stay behind are now traversible once again. Qdrant uses special category-based methods to connect these data points. &lt;/p&gt;
&lt;h2&gt;
  
  
  Qdrant's approach vs traditional filtering methods
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tgwitznyx39lxze4ad1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tgwitznyx39lxze4ad1.png" alt="stepping-lens" width="800" height="118"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The filterable vector index is Qdrant's solves pre and post-filtering problems by adding specialized links to the search graph. It aims to maintain the speed advantages of vector search while allowing for precise filtering, addressing the inefficiencies that can occur when applying filters after the vector search.&lt;/p&gt;
&lt;h3&gt;
  
  
  Pre-filtering
&lt;/h3&gt;

&lt;p&gt;In pre-filtering, a search engine first narrows down the dataset based on chosen metadata values, and then searches within that filtered subset. This reduces unnecessary computation over a dataset that is potentially much larger.&lt;/p&gt;

&lt;p&gt;The choice between pre-filtering and using the filterable HNSW index depends on filter cardinality. When metadata cardinality is too low, the filter becomes restrictive and it can disrupt the connections within the graph. This leads to fragmented search paths (as in &lt;strong&gt;Figure 1&lt;/strong&gt;). When the semantic search process begins, it won’t be able to travel to those locations. &lt;/p&gt;

&lt;p&gt;However, Qdrant still benefits from pre-filtering &lt;strong&gt;under certain conditions&lt;/strong&gt;. In cases of low cardinality, Qdrant's query planner stops using HNSW and switches over to the payload index alone. This makes the search process much cheaper and faster than if using HNSW.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2:&lt;/strong&gt; On the user side, this is how filtering looks. We start with five products with different prices. First, the $1000 price &lt;strong&gt;filter&lt;/strong&gt; is applied, narrowing down the selection of laptops. Then, a vector search finds the relevant &lt;strong&gt;results&lt;/strong&gt; within this filtered set. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxomreb49ere8ow2n5d6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxomreb49ere8ow2n5d6g.png" alt="pre-filtering-vector-search" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In conclusion, pre-filtering is efficient in specific cases when you use small datasets with low cardinality metadata. However, pre-filtering should not be used over large datasets as it breaks too many links in the HNSW graph, causing lower accuracy.&lt;/p&gt;
&lt;h3&gt;
  
  
  Post-filtering
&lt;/h3&gt;

&lt;p&gt;In post-filtering, a search engine first looks for similar vectors and retrieves a larger set of results. Then, it applies filters to those results based on metadata. The problem with post-filtering becomes apparent when using low-cardinality filters. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When you apply a low-cardinality filter after performing a vector search, you often end up discarding a large portion of the results that the vector search returned.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Figure 3:&lt;/strong&gt; In the same example, we have five laptops. First, the vector search finds the top two relevant &lt;strong&gt;results&lt;/strong&gt;, but they may not meet the price match. When the $1000 price &lt;strong&gt;filter&lt;/strong&gt; is applied, other potential results are discarded.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqofpcv9w7q0exrq0cuot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqofpcv9w7q0exrq0cuot.png" alt="post-filtering-vector-search" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The system will waste computational resources by first finding similar vectors and then discarding many that don't meet the filter criteria. You're also limited to filtering only from the initial set of &lt;a href="https://dev.to/advanced-search/"&gt;vector search&lt;/a&gt; results. If your desired items aren't in this initial set, you won't find them, even if they exist in the database.&lt;/p&gt;
&lt;h2&gt;
  
  
  Basic filtering example: ecommerce and laptops
&lt;/h2&gt;

&lt;p&gt;We know that there are three possible laptops that suit our price point. &lt;br&gt;
Let's see how Qdrant's filterable vector index works and why it is the best method of capturing all available results.  &lt;/p&gt;

&lt;p&gt;First, add five new laptops to your online store. Here is a sample input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;laptops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;899.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laptop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1299.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laptop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;799.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laptop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1099.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laptop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;949.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laptop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four-dimensional vector can represent features like laptop CPU, RAM or battery life, but that isn’t specified. The payload, however, specifies the exact price and product category.&lt;/p&gt;

&lt;p&gt;Now, set the filter to "price is less than $1000":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"range"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gte"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lte"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a price filter of equal/less than $1000 is applied, vector search returns the following results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9978443564622781&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;799.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"laptop"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9938079894227599&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;899.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"laptop"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9903751498208603&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;949.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"laptop"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, Qdrant's filtering method has a greater chance of capturing all possible search results. &lt;/p&gt;

&lt;p&gt;This specific example uses the &lt;code&gt;range&lt;/code&gt; condition for filtering. Qdrant, however, offers many other possible ways to structure a filter&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For detailed usage examples, &lt;a href="https://dev.to/documentation/concepts/filtering/"&gt;filtering&lt;/a&gt; docs are the best resource.&lt;/strong&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Scrolling instead of searching
&lt;/h2&gt;

&lt;p&gt;You don't need to use our &lt;code&gt;search&lt;/code&gt; and &lt;code&gt;query&lt;/code&gt; APIs to filter through data. The &lt;code&gt;scroll&lt;/code&gt; API is another option that lets you retrieve lists of points which meet the filters.&lt;/p&gt;

&lt;p&gt;If you aren't interested in finding similar points, you can simply list the ones that match a given filter. While search gives you the most similar points based on some query vector, scroll will give you all points matching your filter not considering similarity. &lt;/p&gt;

&lt;p&gt;In Qdrant, scrolling is used to iteratively &lt;strong&gt;retrieve large sets of points from a collection&lt;/strong&gt;. It is particularly useful when you’re dealing with a large number of points and don’t want to load them all at once. Instead, Qdrant provides a way to scroll through the points &lt;strong&gt;one page at a time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You start by sending a scroll request to Qdrant with specific conditions like filtering by payload, vector search, or other criteria.&lt;/p&gt;

&lt;p&gt;Let's retrieve a list of top 10 laptops ordered by price in the store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /collections/online_store/points/scroll
{
    "filter": {
        "must": [
            {
                "key": "category",
                "match": {
                    "value": "laptop"
                }
            }
        ]
    },
    "limit": 10,
    "with_payload": true,
    "with_vector": false,
    "order_by": [
        {
            "key": "price",
        }
    ]
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response contains a batch of points that match the criteria and a reference (offset or next page token) to retrieve the next set of points.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://dev.to/documentation/concepts/points/#scroll-points"&gt;&lt;strong&gt;Scrolling&lt;/strong&gt;&lt;/a&gt; is designed to be efficient. It minimizes the load on the server and reduces memory consumption on the client side by returning only manageable chunks of data at a time.&lt;/p&gt;
&lt;h3&gt;
  
  
  Available filtering conditions
&lt;/h3&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Condition&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Usage&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Condition&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Usage&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Match&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Exact value match.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Range&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by value range.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Match Any&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Match multiple values.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Datetime Range&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by date range.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Match Except&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Exclude specific values.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;UUID Match&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by unique ID.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nested Key&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by nested data.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Geo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by location.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nested Object&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by nested objects.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Values Count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by element count.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Full Text Match&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Search in text fields.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Is Empty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter empty fields.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Has ID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter by unique ID.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Is Null&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filter null values.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;All clauses and conditions are outlined in Qdrant's &lt;a href="https://dev.to/documentation/concepts/filtering/"&gt;filtering&lt;/a&gt; documentation. &lt;/p&gt;
&lt;h3&gt;
  
  
  Filtering clauses to remember
&lt;/h3&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Clause&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Clause&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Must&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Includes items that meet the condition  (similar to &lt;code&gt;AND&lt;/code&gt;).&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Should&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filters if at least one condition is met  (similar to &lt;code&gt;OR&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Must Not&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excludes items that meet the condition  (similar to &lt;code&gt;NOT&lt;/code&gt;).&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Clauses Combination&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Combines multiple clauses to refine filtering  (similar to &lt;code&gt;AND&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Advanced filtering example: dinosaur diets
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6wzc3g0uqu3rylr714u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6wzc3g0uqu3rylr714u.png" alt="advanced-payload-filtering" width="800" height="158"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also use nested filtering to query arrays of objects within the payload. In this example, we have two points. They each represent a dinosaur with a list of food preferences (diet) that indicate what type of food they like or dislike:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"dinosaur"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t-rex"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"diet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"food"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"leaves"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"food"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"meat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"dinosaur"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"diplodocus"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"diet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"food"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"leaves"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"food"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"meat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To ensure that both conditions are applied to the same array element (e.g., food = meat and likes = true must refer to the same diet item), you need to use a nested filter.&lt;/p&gt;

&lt;p&gt;Nested filters are used to apply conditions within an array of objects. They ensure that the conditions are evaluated per array element, rather than across all elements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /collections/dinosaurs/points/scroll
{
    "filter": {
        "must": [
            {
                "key": "diet[].food",
                  "match": {
                    "value": "meat"
                }
            },
            {
                "key": "diet[].likes",
                  "match": {
                    "value": true
                }
            }
        ]
    }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dinosaurs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;scroll_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;must&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FieldCondition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diet[].food&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MatchValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FieldCondition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diet[].likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MatchValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;dinosaurs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;must&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;diet[].food&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;meat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;diet[].likes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;qdrant_client&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ScrollPointsBuilder&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;
    &lt;span class="nf"&gt;.scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;ScrollPointsBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dinosaurs"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;must&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="nn"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"diet[].food"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"meat"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
            &lt;span class="nn"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"diet[].likes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;])),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.util.List&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;static&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;qdrant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ConditionFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;static&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;qdrant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ConditionFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;matchKeyword&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.qdrant.client.QdrantClient&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.qdrant.client.QdrantGrpcClient&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.qdrant.client.grpc.Points.Filter&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.qdrant.client.grpc.Points.ScrollPoints&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nc"&gt;QdrantClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;QdrantClient&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;QdrantGrpcClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6334&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;scrollAsync&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;ScrollPoints&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCollectionName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dinosaurs"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setFilter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                &lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addAllMust&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                        &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;matchKeyword&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"diet[].food"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"meat"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"diet[].likes"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;Qdrant.Client&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Grpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conditions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;6334&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ScrollAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"dinosaurs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;MatchKeyword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"diet[].food"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"meat"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;Match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"diet[].likes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This happens because both points are matching the two conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the "t-rex" matches food=meat on &lt;code&gt;diet[1].food&lt;/code&gt; and likes=true on &lt;code&gt;diet[1].likes&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;the "diplodocus" matches food=meat on &lt;code&gt;diet[1].food&lt;/code&gt; and likes=true on &lt;code&gt;diet[0].likes&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To retrieve only the points where the conditions apply to a specific element within an array (such as the point with id 1 in this example), you need to use a nested object filter.&lt;/p&gt;

&lt;p&gt;Nested object filters enable querying arrays of objects independently, ensuring conditions are checked within individual array elements.&lt;/p&gt;

&lt;p&gt;This is done by using the &lt;code&gt;nested&lt;/code&gt; condition type, which consists of a payload key that targets an array and a filter to apply. The key should reference an array of objects and can be written with or without bracket notation (e.g., "data" or "data[]").&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /collections/dinosaurs/points/scroll
{
    "filter": {
        "must": [{
            "nested": {
                "key": "diet",
                "filter":{
                    "must": [
                        {
                            "key": "food",
                            "match": {
                                "value": "meat"
                            }
                        },
                        {
                            "key": "likes",
                            "match": {
                                "value": true
                            }
                        }
                    ]
                }
            }
        }]
    }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dinosaurs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;scroll_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;must&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;NestedCondition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;nested&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Nested&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;must&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                            &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FieldCondition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;food&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MatchValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="p"&gt;),&lt;/span&gt;
                            &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FieldCondition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MatchValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="p"&gt;),&lt;/span&gt;
                        &lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;dinosaurs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;must&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;nested&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;diet&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;must&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
              &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;food&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;meat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
              &lt;span class="p"&gt;},&lt;/span&gt;
              &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;likes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
              &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
          &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;qdrant_client&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NestedCondition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ScrollPointsBuilder&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;
    &lt;span class="nf"&gt;.scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;ScrollPointsBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dinosaurs"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;must&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;NestedCondition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"diet"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;must&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
                &lt;span class="nn"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"food"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"meat"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                &lt;span class="nn"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;])),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;()])),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.util.List&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;static&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;qdrant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ConditionFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;static&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;qdrant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ConditionFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;matchKeyword&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;static&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;qdrant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ConditionFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;nested&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.qdrant.client.grpc.Points.Filter&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.qdrant.client.grpc.Points.ScrollPoints&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;scrollAsync&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;ScrollPoints&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCollectionName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dinosaurs"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setFilter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                &lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addMust&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;nested&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                            &lt;span class="s"&gt;"diet"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
                            &lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addAllMust&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                                    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                                        &lt;span class="n"&gt;matchKeyword&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"food"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"meat"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"likes"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
                                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;Qdrant.Client&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Grpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conditions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;6334&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ScrollAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"dinosaurs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Nested&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"diet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MatchKeyword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"food"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"meat"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;Match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The matching logic is adjusted to operate at the level of individual elements within an array in the payload.&lt;/p&gt;

&lt;p&gt;Nested filters function as though each element of the array is evaluated separately. The parent document will be considered a match if at least one array element satisfies the nested filter conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other creative uses for filters
&lt;/h2&gt;

&lt;p&gt;You can use filters to retrieve data points without knowing their &lt;code&gt;id&lt;/code&gt;. You can search through data and manage it, solely by using filters. Let's take a look at some creative uses for filters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/points/#delete-points"&gt;Delete Points&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Deletes all points matching the filter.&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/payload/#set-payload"&gt;Set Payload&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Adds payload fields to all points matching the filter.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/points/#scroll-points"&gt;Scroll Points&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Lists all points matching the filter.&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/payload/#overwrite-payload"&gt;Update Payload&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Updates payload fields for points matching the filter.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/points/#order-points-by-payload-key"&gt;Order Points&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Lists all points, sorted by the filter.&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/payload/#delete-payload-keys"&gt;Delete Payload&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Deletes fields for points matching the filter.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/points/#counting-points"&gt;Count Points&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Totals the points matching the filter.&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Filtering with the payload index
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnetjc13aqk0wuoqxh06e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnetjc13aqk0wuoqxh06e.png" alt="vector-search-filtering-vector-search" width="800" height="130"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you start working with Qdrant, your data is by default organized in a vector index. &lt;br&gt;
In addition to this, we recommend adding a secondary data structure - &lt;strong&gt;the payload index&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Just how the vector index organizes vectors, the payload index will structure your metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 4:&lt;/strong&gt; The payload index is an additional data structure that supports vector search. A payload index (in green) organizes candidate results by cardinality, so that semantic search (in red) can traverse the vector index quickly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffur2cj4wfc4ja57jufln.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffur2cj4wfc4ja57jufln.png" alt="payload-index-vector-search" width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On its own, semantic searching over terabytes of data can take up lots of RAM. &lt;a href="https://dev.to/documentation/concepts/filtering/"&gt;&lt;strong&gt;Filtering&lt;/strong&gt;&lt;/a&gt; and &lt;a href="https://dev.to/documentation/concepts/indexing/"&gt;&lt;strong&gt;Indexing&lt;/strong&gt;&lt;/a&gt; are two easy strategies to reduce your compute usage and still get the best results. Remember, this is only a guide. For an exhaustive list of filtering options, you should read the &lt;a href="https://dev.to/documentation/concepts/filtering/"&gt;filtering documentation&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Here is how you can create a single index for a metadata field "category":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/computers/index
{
    "field_name": "category",
    "field_schema": "keyword"
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:6333&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_payload_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;computers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;field_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;field_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you mark a field indexable, &lt;strong&gt;you don't need to do anything else&lt;/strong&gt;. Qdrant will handle all optimizations in the background.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why should you index metadata?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhez94br0kvkjiz02m74u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhez94br0kvkjiz02m74u.png" alt="payload-index-filtering" width="800" height="141"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The payload index acts as a secondary data structure that speeds up retrieval. Whenever you run vector search with a filter, Qdrant will consult a payload index - if there is one. &lt;/p&gt;

&lt;p&gt;If you are indexing your metadata, the difference in search performance can be dramatic. &lt;/p&gt;

&lt;p&gt;As your dataset grows in complexity, Qdrant takes up additional resources to go through all data points. Without a proper data structure, the search can take longer - or run out of resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Payload indexing helps evaluate the most restrictive filters
&lt;/h3&gt;

&lt;p&gt;The payload index is also used to accurately estimate &lt;strong&gt;filter cardinality&lt;/strong&gt;, which helps the query planning choose a search strategy. &lt;strong&gt;Filter cardinality&lt;/strong&gt; refers to the number of distinct values that a filter can match within a dataset. Qdrant's search strategy can switch from &lt;strong&gt;HNSW search&lt;/strong&gt; to &lt;strong&gt;payload index-based search&lt;/strong&gt; if the cardinality is too low.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it affects your queries:&lt;/strong&gt; Depending on the filter used in the search - there are several possible scenarios for query execution. Qdrant chooses one of the query execution options depending on the available indexes, the complexity of the conditions and the cardinality of the filtering result. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The planner estimates the cardinality of a filtered result before selecting a strategy.&lt;/li&gt;
&lt;li&gt;Qdrant retrieves points using the &lt;strong&gt;payload index&lt;/strong&gt; if cardinality is below threshold.&lt;/li&gt;
&lt;li&gt;Qdrant uses the &lt;strong&gt;filterable vector index&lt;/strong&gt; if the cardinality is above a threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What happens if you don't use payload indexes?
&lt;/h3&gt;

&lt;p&gt;If you only rely on &lt;strong&gt;searching for the nearest vector&lt;/strong&gt;, Qdrant will have to go through the entire vector index. It will calculate similarities against each vector in the collection, relevant or not. Alternatively, when you filter with the help of a payload index, the HSNW algorithm won't have to evaluate every point. Furthermore, the payload index will help HNSW  construct the graph with additional links.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does the payload index look?
&lt;/h2&gt;

&lt;p&gt;A payload index is similar to conventional document-oriented databases. It connects metadata fields with their corresponding point id’s for quick retrieval. &lt;/p&gt;

&lt;p&gt;In this example, you are indexing all of your computer hardware inside of the &lt;code&gt;computers&lt;/code&gt; collection. Let’s take a look at a sample payload index for the field &lt;code&gt;category&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Payload&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Index&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;keyword:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;+------------+-------------+&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;category&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;id&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;+------------+-------------+&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;laptop&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;desktop&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;speakers&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;keyboard&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;+------------+-------------+&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When fields are properly indexed, the search engine roughly knows where it can start its journey. It can start looking up points that contain relevant metadata, and it doesn’t need to scan the entire dataset. This reduces the engine’s workload by a lot. As a result, query results are faster and the system can easily scale.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You may create as many payload indexes as you want, and we recommend you do so for each field that is frequently used. &lt;br&gt;
If your users are often filtering by &lt;strong&gt;laptop&lt;/strong&gt; when looking up a product &lt;strong&gt;category&lt;/strong&gt;, indexing all computer metadata will speed up retrieval and make the results more precise.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Different types of payload indexes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Index Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/indexing/#full-text-index"&gt;Full-text Index&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Enables efficient text search in large datasets.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/indexing/#tenant-index"&gt;Tenant Index&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;For data isolation and retrieval efficiency in multi-tenant architectures.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/indexing/#principal-index"&gt;Principal Index&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Manages data based on primary entities like users or accounts.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/indexing/#on-disk-payload-index"&gt;On-Disk Index&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Stores indexes on disk to manage large datasets without memory usage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/documentation/concepts/indexing/#parameterized-index"&gt;Parameterized Index&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Allows for dynamic querying, where the index can adapt based on different parameters or conditions provided by the user. Useful for numeric data like prices or timestamps.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Indexing payloads in multitenant setups
&lt;/h2&gt;

&lt;p&gt;Some applications need to have data segregated, whereby different users need to see different data inside of the same program. When setting up storage for such a complex application, many users think they need multiple databases for segregated users.    &lt;/p&gt;

&lt;p&gt;We see this quite often. Users very frequently make the mistake of creating a separate collection for each tenant inside of the same cluster. This can quickly exhaust the cluster’s resources. Running vector search through too many collections can start using up too much RAM. You may start seeing out-of-memory (OOM) errors and degraded performance. &lt;/p&gt;

&lt;p&gt;To mitigate this, we offer extensive support for multitenant systems, so that you can build an entire global application in one single Qdrant collection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT /collections/{collection_name}/index
{
   "field_name": "payload_field_name",
   "field_schema": {
       "type": "keyword",
       "is_tenant": true
   }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tenant index is another variant of the payload index. When creating or updating a collection, you can mark a metadata field as indexable. This time, the request will specify the field as a tenant. This means that you can mark various user types and customer id’s as &lt;code&gt;is_tenant&lt;/code&gt;: true. &lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways in filtering and indexing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsvghqy7yhnib2rs2znau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsvghqy7yhnib2rs2znau.png" alt="best-practices" width="800" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Filtering with float-point (decimal) numbers
&lt;/h2&gt;

&lt;p&gt;If you filter by the float data type, your search precision may be limited and inaccurate. &lt;/p&gt;

&lt;p&gt;Float Datatype numbers have a decimal point and are 64 bits in size. Here is an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;11.99&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you filter for a specific float number, such as 11.99, you may get a different result, like 11.98 or 12.00. With decimals, numbers are rounded differently, so logically identical values may appear different. Unfortunately, searching for exact matches can be unreliable in this case.&lt;/p&gt;

&lt;p&gt;To avoid inaccuracies, use a different filtering method. We recommend that you try Range Based Filtering instead of exact matches. This method accounts for minor variations in data, and it boosts performance - especially with large datasets. &lt;/p&gt;

&lt;p&gt;Here is a sample JSON range filter for values greater than or equal to 11.99 and less than or equal to the same number. This will retrieve any values within the range of 11.99, including those with additional decimal places.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"range"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"gt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"gte"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;11.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"lt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"lte"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;11.99&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Working with pagination in queries
&lt;/h2&gt;

&lt;p&gt;When you're implementing pagination in filtered queries, indexing becomes even more critical. When paginating results, you often need to exclude items you've already seen. This is typically managed by applying filters that specify which IDs should not be included in the next set of results. &lt;/p&gt;

&lt;p&gt;However, an interesting aspect of Qdrant's data model is that a single point can have multiple values for the same field, such as different color options for a product. This means that during filtering, an ID might appear multiple times if it matches on different values of the same field. &lt;/p&gt;

&lt;p&gt;Proper indexing ensures that these queries are efficient, preventing duplicate results and making pagination smoother.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Real-life use cases of filtering
&lt;/h2&gt;

&lt;p&gt;Filtering in a &lt;a href="https://qdrant.tech" rel="noopener noreferrer"&gt;vector database&lt;/a&gt; like Qdrant can significantly enhance search capabilities by enabling more precise and efficient retrieval of data. &lt;/p&gt;

&lt;p&gt;As a conclusion to this guide, let's look at some real-life use cases where filtering is crucial:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Use Case&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Vector Search&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Filtering&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/advanced-search/"&gt;E-Commerce Product Search&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Search for products by style or visual similarity&lt;/td&gt;
&lt;td&gt;Filter by price, color, brand, size, ratings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/recommendations/"&gt;Recommendation Systems&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Recommend similar content (e.g., movies, songs)&lt;/td&gt;
&lt;td&gt;Filter by release date, genre, etc. (e.g., movies after 2020)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/articles/geo-polygon-filter-gsoc/"&gt;Geospatial Search in Ride-Sharing&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Find similar drivers or delivery partners&lt;/td&gt;
&lt;td&gt;Filter by rating, distance radius, vehicle type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/data-analysis-anomaly-detection/"&gt;Fraud &amp;amp; Anomaly Detection&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Detect transactions similar to known fraud cases&lt;/td&gt;
&lt;td&gt;Filter by amount, time, location&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Before you go - all the code is in Qdrant's Dashboard
&lt;/h3&gt;

&lt;p&gt;The easiest way to reach that "Hello World" moment is to &lt;a href="https://dev.to/documentation/quickstart-cloud/"&gt;&lt;strong&gt;try filtering in a live cluster&lt;/strong&gt;&lt;/a&gt;. Our interactive tutorial will show you how to create a cluster, add data and try some filtering clauses. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>database</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance!</title>
      <dc:creator>Kacper Łukawski</dc:creator>
      <pubDate>Thu, 29 Aug 2024 15:59:51 +0000</pubDate>
      <link>https://dev.to/qdrant/any-embedding-model-can-become-a-late-interaction-model-if-you-give-it-a-chance-3iip</link>
      <guid>https://dev.to/qdrant/any-embedding-model-can-become-a-late-interaction-model-if-you-give-it-a-chance-3iip</guid>
      <description>&lt;p&gt;* At least any open-source model, since you need access to its internals.&lt;/p&gt;

&lt;h3&gt;
  
  
  You Can Adapt Dense Embedding Models for Late Interaction
&lt;/h3&gt;

&lt;p&gt;Qdrant 1.10 introduced support for multi-vector representations, with late interaction being a prominent example of this model. In essence, both documents and queries are represented by multiple vectors, and identifying the most relevant documents involves calculating a score based on the similarity between corresponding query and document embeddings. If you're not familiar with this paradigm, our updated &lt;a href="https://dev.to/articles/hybrid-search/"&gt;Hybrid Search&lt;/a&gt; article explains how multi-vector representations can enhance retrieval quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; We can visualize late interaction between corresponding document-query embedding pairs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3zfr1tmzptr7ra0sjaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3zfr1tmzptr7ra0sjaz.png" alt="Late interaction" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are many specialized late interaction models, such as ColBERT, but &lt;strong&gt;it appears that regular dense embedding models can also be effectively utilized in this manner&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In this study, we will demonstrate that standard dense embedding models, traditionally used for single-vector representations, can be effectively adapted for late interaction scenarios using output token embeddings as multi-vector representations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By testing out retrieval with Qdrant’s multi-vector feature, we will show that these models can rival or surpass specialized late interaction models in retrieval performance, while offering lower complexity and greater efficiency. This work redefines the potential of dense models in advanced search pipelines, presenting a new method for optimizing retrieval systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Embedding Models
&lt;/h2&gt;

&lt;p&gt;The inner workings of embedding models might be surprising to some. The model doesn’t operate directly on the input text; instead, it requires a tokenization step to convert the text into a sequence of token identifiers. Each token identifier is then passed through an embedding layer, which transforms it into a dense vector. Essentially, the embedding layer acts as a lookup table that maps token identifiers to dense vectors. These vectors are then fed into the transformer model as input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2:&lt;/strong&gt; The tokenization step, which takes place before vectors are added to the transformer model. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81uxh0sz0fagvusxibqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81uxh0sz0fagvusxibqu.png" alt="Input token embeddings" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The input token embeddings are context-free and are learned during the model’s training process. This means that each token always receives the same embedding, regardless of its position in the text. At this stage, the token embeddings are unaware of the context in which they appear. It is the transformer model’s role to contextualize these embeddings.&lt;/p&gt;

&lt;p&gt;Much has been discussed about the role of attention in transformer models, but in essence, this mechanism is responsible for capturing cross-token relationships. Each transformer module takes a sequence of token embeddings as input and produces a sequence of output token embeddings. Both sequences are of the same length, with each token embedding being enriched by information from the other token embeddings at the current step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 3:&lt;/strong&gt; The mechanism which produces a sequence of output token embeddings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdg5yjc55utu5fn30n0zx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdg5yjc55utu5fn30n0zx.png" alt="Output token embeddings" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 4:&lt;/strong&gt; The final step performed by the embedding model is pooling the output token embeddings to generate a single vector representation of the input text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feop04kiezatxax2jsk3f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feop04kiezatxax2jsk3f.png" alt="Pooling" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are several pooling strategies, but regardless of which one a model uses, the output is always a single vector representation, which inevitably loses some information about the input. It’s akin to giving someone detailed, step-by-step directions to the nearest grocery store versus simply pointing in the general direction. While the vague direction might suffice in some cases, the detailed instructions are more likely to lead to the desired outcome.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Output Token Embeddings for Multi-Vector Representations
&lt;/h3&gt;

&lt;p&gt;We often overlook the output token embeddings, but the fact is—they also serve as multi-vector representations of the input text. So, why not explore their use in a multi-vector retrieval model, similar to late interaction models?&lt;/p&gt;

&lt;h4&gt;
  
  
  Experimental Findings
&lt;/h4&gt;

&lt;p&gt;We conducted several experiments to determine whether output token embeddings could be effectively used in place of traditional late interaction models. The results are quite promising.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th&gt;Dataset&lt;/th&gt;
            &lt;th&gt;Model&lt;/th&gt;
            &lt;th&gt;Experiment&lt;/th&gt;
            &lt;th&gt;NDCG@10&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;th rowspan="6"&gt;SciFact&lt;/th&gt;
            &lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;sparse vectors&lt;/td&gt;
            &lt;td&gt;0.70928&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;late interaction model&lt;/td&gt;
            &lt;td&gt;0.69579&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.64508&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.70724&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.68213&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.73696&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td colspan="4"&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th rowspan="6"&gt;NFCorpus&lt;/th&gt;
            &lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;sparse vectors&lt;/td&gt;
            &lt;td&gt;0.34166&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;late interaction model&lt;/td&gt;
            &lt;td&gt;0.35036&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.31594&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.35779&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.29696&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.37502&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td colspan="4"&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th rowspan="6"&gt;ArguAna&lt;/th&gt;
            &lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;sparse vectors&lt;/td&gt;
            &lt;td&gt;0.47271&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;late interaction model&lt;/td&gt;
            &lt;td&gt;0.44534&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;0.50167&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.45997&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;single dense vector representation&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.58857&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.57648&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href="https://github.com/kacperlukawski/beir-qdrant/blob/main/examples/retrieval/search/evaluate_all_exact.py" rel="noopener noreferrer"&gt;source code for these experiments is open-source&lt;/a&gt; and utilizes &lt;a href="https://github.com/kacperlukawski/beir-qdrant" rel="noopener noreferrer"&gt;&lt;code&gt;beir-qdrant&lt;/code&gt;&lt;/a&gt;, an integration of Qdrant with the &lt;a href="https://github.com/beir-cellar/beir" rel="noopener noreferrer"&gt;BeIR library&lt;/a&gt;. While this package is not officially maintained by the Qdrant team, it may prove useful for those interested in experimenting with various Qdrant configurations to see how they impact retrieval quality. All experiments were conducted using Qdrant in exact search mode, ensuring the results are not influenced by approximate search.&lt;/p&gt;

&lt;p&gt;Even the simple &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model can be applied in a late interaction model fashion, resulting in a positive impact on retrieval quality. However, the best results were achieved with the &lt;code&gt;BAAI/bge-small-en&lt;/code&gt; model, which outperformed both sparse and late interaction models.&lt;/p&gt;

&lt;p&gt;It's important to note that ColBERT has not been trained on BeIR datasets, making its performance fully out-of-domain. Nevertheless, the &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data" rel="noopener noreferrer"&gt;training dataset&lt;/a&gt; also lacks any BeIR data, yet it still performs remarkably well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparative Analysis of Dense vs. Late Interaction Models
&lt;/h3&gt;

&lt;p&gt;The retrieval quality speaks for itself, but there are other important factors to consider.&lt;/p&gt;

&lt;p&gt;The traditional dense embedding models we tested are less complex than late interaction or sparse models. With fewer parameters, these models are expected to be faster during inference and more cost-effective to maintain. Below is a comparison of the models used in the experiments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Number of parameters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prithivida/Splade_PP_en_v1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;109,514,298&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;colbert-ir/colbertv2.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;109,580,544&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;33,360,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;22,713,216&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One argument against using output token embeddings is the increased storage requirements compared to ColBERT-like models. For instance, the &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model produces 384-dimensional output token embeddings, which is three times more than the 128-dimensional embeddings generated by ColBERT-like models. This increase not only leads to higher memory usage but also impacts the computational cost of retrieval, as calculating distances takes more time. Mitigating this issue through vector compression would make a lot of sense.&lt;/p&gt;

&lt;h4&gt;
  
  
  Exploring Quantization for Multi-Vector Representations
&lt;/h4&gt;

&lt;p&gt;Binary quantization is generally more effective for high-dimensional vectors, making the &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model, with its relatively low-dimensional outputs, less ideal for this approach. However, scalar quantization appeared to be a viable alternative. The table below summarizes the impact of quantization on retrieval quality.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th&gt;Dataset&lt;/th&gt;
            &lt;th&gt;Model&lt;/th&gt;
            &lt;th&gt;Experiment&lt;/th&gt;
            &lt;th&gt;NDCG@10&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;th rowspan="2"&gt;SciFact&lt;/th&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.70724&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings (uint8)&lt;/td&gt;
            &lt;td&gt;0.70297&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td colspan="4"&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th rowspan="2"&gt;NFCorpus&lt;/th&gt;
            &lt;td rowspan="2"&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/td&gt;
            &lt;td&gt;output token embeddings&lt;/td&gt;
            &lt;td&gt;0.35779&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;output token embeddings (uint8)&lt;/td&gt;
            &lt;td&gt;0.35572&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It’s important to note that quantization doesn’t always preserve retrieval quality at the same level, but in this case, scalar quantization appears to have minimal impact on retrieval performance. The effect is negligible, while the memory savings are substantial.&lt;/p&gt;

&lt;p&gt;We managed to maintain the original quality while using four times less memory. Additionally, a quantized vector requires 384 bytes, compared to ColBERT’s 512 bytes. This results in a 25% reduction in memory usage, with retrieval quality remaining nearly unchanged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Application: Enhancing Retrieval with Dense Models
&lt;/h3&gt;

&lt;p&gt;If you’re using one of the sentence transformer models, the output token embeddings are calculated by default. While a single vector representation is more efficient in terms of storage and computation, there’s no need to discard the output token embeddings. According to our experiments, these embeddings can significantly enhance retrieval quality. You can store both the single vector and the output token embeddings in Qdrant, using the single vector for the initial retrieval step and then reranking the results with the output token embeddings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 5:&lt;/strong&gt; A single model pipeline that relies solely on the output token embeddings for reranking.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feosc6fs0fez7z7uvt6m9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feosc6fs0fez7z7uvt6m9.png" alt="Single model reranking" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To demonstrate this concept, we implemented a simple reranking pipeline in Qdrant. This pipeline uses a dense embedding model for the initial oversampled retrieval and then relies solely on the output token embeddings for the reranking step.&lt;/p&gt;

&lt;h4&gt;
  
  
  Single Model Retrieval and Reranking Benchmarks
&lt;/h4&gt;

&lt;p&gt;Our tests focused on using the same model for both retrieval and reranking. The reported metric is &lt;a href="mailto:NDCG@10"&gt;NDCG@10&lt;/a&gt;. In all tests, we applied an oversampling factor of 5x, meaning the retrieval step returned 50 results, which were then narrowed down to 10 during the reranking step. Below are the results for some of the BeIR datasets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th rowspan="2"&gt;Dataset&lt;/th&gt;
            &lt;th colspan="2"&gt;&lt;code&gt;all-miniLM-L6-v2&lt;/code&gt;&lt;/th&gt;
            &lt;th colspan="2"&gt;&lt;code&gt;BAAI/bge-small-en&lt;/code&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;dense embeddings only&lt;/th&gt;
            &lt;th&gt;dense + reranking&lt;/th&gt;
            &lt;th&gt;dense embeddings only&lt;/th&gt;
            &lt;th&gt;dense + reranking&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;th&gt;SciFact&lt;/th&gt;
            &lt;td&gt;0.64508&lt;/td&gt;
            &lt;td&gt;0.70293&lt;/td&gt;
            &lt;td&gt;0.68213&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.73053&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;NFCorpus&lt;/th&gt;
            &lt;td&gt;0.31594&lt;/td&gt;
            &lt;td&gt;0.34297&lt;/td&gt;
            &lt;td&gt;0.29696&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.35996&lt;/u&gt;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;ArguAna&lt;/th&gt;
            &lt;td&gt;0.50167&lt;/td&gt;
            &lt;td&gt;0.45378&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.58857&lt;/u&gt;&lt;/td&gt;
            &lt;td&gt;0.57302&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;Touche-2020&lt;/th&gt;
            &lt;td&gt;0.16904&lt;/td&gt;
            &lt;td&gt;0.19693&lt;/td&gt;
            &lt;td&gt;0.13055&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.19821&lt;/u&gt;&lt;/td&gt;        
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;TREC-COVID&lt;/th&gt;
            &lt;td&gt;0.47246&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.6379&lt;/u&gt;&lt;/td&gt;
            &lt;td&gt;0.45788&lt;/td&gt;
            &lt;td&gt;0.53539&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;th&gt;FiQA-2018&lt;/th&gt;
            &lt;td&gt;0.36867&lt;/td&gt;
            &lt;td&gt;&lt;u&gt;0.41587&lt;/u&gt;&lt;/td&gt;
            &lt;td&gt;0.31091&lt;/td&gt;
            &lt;td&gt;0.39067&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The source code for the benchmark is publicly available, and &lt;a href="https://github.com/kacperlukawski/beir-qdrant/blob/main/examples/retrieval/search/evaluate_reranking.py" rel="noopener noreferrer"&gt;you can find it in the repository of the &lt;code&gt;beir-qdrant&lt;/code&gt; package&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Overall, adding a reranking step using the same model typically improves retrieval quality. However, the quality of various late interaction models is &lt;a href="https://huggingface.co/mixedbread-ai/mxbai-colbert-large-v1#1-reranking-performance" rel="noopener noreferrer"&gt;often reported based on their reranking performance when BM25 is used for the initial retrieval&lt;/a&gt;. This experiment aimed to demonstrate how a single model can be effectively used for both retrieval and reranking, and the results are quite promising.&lt;/p&gt;

&lt;p&gt;Now, let's explore how to implement this using the new Query API introduced in Qdrant 1.10.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guide: Setting Up Qdrant for Late Interaction
&lt;/h3&gt;

&lt;p&gt;The new Query API in Qdrant 1.10 enables the construction of even more complex retrieval pipelines. We can use the single vector created after pooling for the initial retrieval step and then rerank the results using the output token embeddings.&lt;/p&gt;

&lt;p&gt;Assuming the collection is named &lt;code&gt;my-collection&lt;/code&gt; and is configured to store two named vectors: &lt;code&gt;dense-vector&lt;/code&gt; and &lt;code&gt;output-token-embeddings&lt;/code&gt;, here’s how such a collection could be created in Qdrant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:6333&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-collection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dense-vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output-token-embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;multivector_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MultiVectorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;comparator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MultiVectorComparator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAX_SIM&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both vectors are of the same size since they are produced by the same &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, instead of using the search API with just a single dense vector, we can create a reranking pipeline. First, we retrieve 50 results using the dense vector, and then we rerank them using the output token embeddings to obtain the top 10 results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What else can be done with just all-MiniLM-L6-v2 model?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_points&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-collection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prefetch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="c1"&gt;# Prefetch the dense embeddings of the top-50 documents
&lt;/span&gt;        &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Prefetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dense-vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="c1"&gt;# Rerank the top-50 documents retrieved by the dense embedding model
&lt;/span&gt;    &lt;span class="c1"&gt;# and return just the top-10. Please note we call the same model, but
&lt;/span&gt;    &lt;span class="c1"&gt;# we ask for the token embeddings by setting the output_value parameter.
&lt;/span&gt;    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output-token-embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Try the Experiment Yourself
&lt;/h3&gt;

&lt;p&gt;In a real-world scenario, you might take it a step further by first calculating the token embeddings and then performing pooling to obtain the single vector representation. This approach allows you to complete everything in a single pass.&lt;/p&gt;

&lt;p&gt;The simplest way to start experimenting with building complex reranking pipelines in Qdrant is by using the forever-free cluster on &lt;a href="https://cloud.qdrant.io/" rel="noopener noreferrer"&gt;Qdrant Cloud&lt;/a&gt; and reading &lt;a href="https://dev.to/documentation/"&gt;Qdrant's documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/kacperlukawski/beir-qdrant/blob/main/examples/retrieval/search/evaluate_all_exact.py" rel="noopener noreferrer"&gt;source code for these experiments is open-source&lt;/a&gt; and uses &lt;a href="https://github.com/kacperlukawski/beir-qdrant" rel="noopener noreferrer"&gt;&lt;code&gt;beir-qdrant&lt;/code&gt;&lt;/a&gt;, an integration of Qdrant with the &lt;a href="https://github.com/beir-cellar/beir" rel="noopener noreferrer"&gt;BeIR library&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Future Directions and Research Opportunities
&lt;/h3&gt;

&lt;p&gt;The initial experiments using output token embeddings in the retrieval process have yielded promising results. However, we plan to conduct further benchmarks to validate these findings and explore the incorporation of sparse methods for the initial retrieval. Additionally, we aim to investigate the impact of quantization on multi-vector representations and its effects on retrieval quality. Finally, we will assess retrieval speed, a crucial factor for many applications.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>opensource</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Optimizing RAG Through an Evaluation-Based Methodology</title>
      <dc:creator>Atita Arora</dc:creator>
      <pubDate>Wed, 03 Jul 2024 13:18:22 +0000</pubDate>
      <link>https://dev.to/qdrant/optimizing-rag-through-an-evaluation-based-methodology-45pj</link>
      <guid>https://dev.to/qdrant/optimizing-rag-through-an-evaluation-based-methodology-45pj</guid>
      <description>&lt;p&gt;In today's fast-paced, information-rich world, AI is revolutionizing knowledge management. The systematic process of capturing, distributing, and effectively using knowledge within an organization is one of the fields in which AI provides exceptional value today. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The potential for AI-powered knowledge management increases when leveraging Retrieval Augmented Generation (RAG), a methodology that enables LLMs to access a vast, diverse repository of factual information from knowledge stores, such as vector databases. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This process enhances the accuracy, relevance, and reliability of generated text, thereby mitigating the risk of faulty, incorrect, or nonsensical results sometimes associated with traditional LLMs. This method not only ensures that the answers are contextually relevant but also up-to-date, reflecting the latest insights and data available.&lt;/p&gt;

&lt;p&gt;While RAG enhances the accuracy, relevance, and reliability of traditional LLM solutions, &lt;strong&gt;an evaluation strategy can further help teams ensure their AI products meet these benchmarks of success.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Relevant tools for this experiment
&lt;/h2&gt;

&lt;p&gt;In this article, we’ll break down a RAG Optimization workflow experiment that demonstrates that evaluation is essential to build a successful RAG strategy. We will use Qdrant and Quotient for this experiment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://qdrant.tech/"&gt;Qdrant&lt;/a&gt; is a vector database and vector similarity search engine designed for efficient storage and retrieval of high-dimensional vectors. Because Qdrant offers efficient indexing and searching capabilities, it is ideal for implementing RAG solutions, where quickly and accurately retrieving relevant information from extremely large datasets is crucial. Qdrant also offers a wealth of additional features, such as quantization, multivector support and multi-tenancy.&lt;/p&gt;

&lt;p&gt;Alongside Qdrant we will use Quotient, which provides a seamless way to evaluate your RAG implementation, accelerating and improving the experimentation process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.quotientai.co/"&gt;Quotient&lt;/a&gt; is a platform that provides tooling for AI developers to build evaluation frameworks and conduct experiments on their products. Evaluation is how teams surface the shortcomings of their applications and improve performance in key benchmarks such as faithfulness, and semantic similarity. Iteration is key to building innovative AI products that will deliver value to end users.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 The &lt;a href="https://github.com/qdrant/qdrant-rag-eval/tree/master/workshop-rag-eval-qdrant-quotient"&gt;accompanying notebook&lt;/a&gt; for this exercise can be found on GitHub for future reference.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Summary of key findings
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Irrelevance and Hallucinations&lt;/strong&gt;: When the documents retrieved are irrelevant, evidenced by low scores in both Chunk Relevance and Context Relevance, the model is prone to generating inaccurate or fabricated information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimizing Document Retrieval&lt;/strong&gt;: By retrieving a greater number of documents and reducing the chunk size, we observed improved outcomes in the model's performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Retrieval Needs&lt;/strong&gt;: Certain queries may benefit from accessing more documents. Implementing a dynamic retrieval strategy that adjusts based on the query could enhance accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Influence of Model and Prompt Variations&lt;/strong&gt;: Alterations in language models or the prompts used can significantly impact the quality of the generated responses, suggesting that fine-tuning these elements could optimize performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let us walk you through how we arrived at these findings!&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a RAG pipeline
&lt;/h2&gt;

&lt;p&gt;To evaluate a RAG pipeline , we will have to build a RAG Pipeline first. In the interest of simplicity, we are building a Naive RAG in this article. There are certainly other versions of RAG :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ulnki54g5m4xmism86x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ulnki54g5m4xmism86x.png" alt="shades_of_rag.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The illustration below depicts how we can leverage a RAG Evaluation framework to assess the quality of RAG Application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpsymcileyqa6fxwajdy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpsymcileyqa6fxwajdy.png" alt="qdrant_and_quotient.png" width="800" height="559"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are going to build a RAG application using Qdrant’s Documentation and the premeditated &lt;a href="[https://huggingface.co/datasets/atitaarora/qdrant_doc](https://huggingface.co/datasets/atitaarora/qdrant_doc)"&gt;hugging face dataset&lt;/a&gt;.&lt;br&gt;
We will then assess our RAG application’s ability to answer questions about Qdrant.&lt;/p&gt;

&lt;p&gt;To prepare our knowledge store we will use Qdrant, which can be leveraged in 3 different ways as below :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;##Uncomment to initialise qdrant client in memory
#client = qdrant_client.QdrantClient(
#    location=":memory:",
#)
&lt;/span&gt;
&lt;span class="c1"&gt;##Uncomment below to connect to Qdrant Cloud
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QDRANT_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QDRANT_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## Uncomment below to connect to local Qdrant
#client = qdrant_client.QdrantClient("http://localhost:6333")
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will be using &lt;a href="https://cloud.qdrant.io/login"&gt;Qdrant Cloud&lt;/a&gt; so it is a good idea to provide the &lt;code&gt;QDRANT_URL&lt;/code&gt; and &lt;code&gt;QDRANT_API_KEY&lt;/code&gt; as environment variables for easier access.&lt;/p&gt;

&lt;p&gt;Moving on, we will need to define the collection name as :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant-docs-quotient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case , we may need to create different collections based on the experiments we conduct.&lt;/p&gt;

&lt;p&gt;To help us provide seamless embedding creations throughout the experiment, we will use Qdrant’s native embedding provider &lt;a href="[https://qdrant.github.io/fastembed/](https://qdrant.github.io/fastembed/)"&gt;Fastembed&lt;/a&gt; which supports &lt;a href="[https://qdrant.github.io/fastembed/examples/Supported_Models/](https://qdrant.github.io/fastembed/examples/Supported_Models/)"&gt;many different models&lt;/a&gt; including dense as well as sparse vector models.&lt;/p&gt;

&lt;p&gt;We can initialize and switch the embedding model of our choice as below :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;## Declaring the intended Embedding Model with Fastembed
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastembed.embedding&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextEmbedding&lt;/span&gt;

&lt;span class="c1"&gt;## General Fastembed specific operations
##Initilising embedding model
## Using Default Model - BAAI/bge-small-en-v1.5
&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TextEmbedding&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;## For custom model supported by Fastembed
#embedding_model = TextEmbedding(model_name="BAAI/bge-small-en", max_length=512)
#embedding_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2", max_length=384)
&lt;/span&gt;
&lt;span class="c1"&gt;## Verify the chosen Embedding model
&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before implementing RAG, we need to prepare and index our data in Qdrant.&lt;/p&gt;

&lt;p&gt;This involves converting textual data into vectors using a suitable encoder (e.g., sentence transformers), and storing these vectors in Qdrant for retrieval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.docstore.document&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Document&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;LangchainDocument&lt;/span&gt;

&lt;span class="c1"&gt;## Load the dataset with qdrant documentation
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;atitaarora/qdrant_doc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## Dataset to langchain document
&lt;/span&gt;&lt;span class="n"&gt;langchain_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;LangchainDocument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;langchain_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Outputs
#240
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can preview documents in the dataset as below :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;## Here's an example of what a document in our dataset looks like
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evaluation dataset
&lt;/h2&gt;

&lt;p&gt;To measure the quality of our RAG setup, we will need a representative evaluation dataset. This dataset should contain realistic questions and the expected answers.&lt;/p&gt;

&lt;p&gt;Additionally, including the expected contexts for which your RAG pipeline is designed to retrieve information would be beneficial.&lt;/p&gt;

&lt;p&gt;We will be using a &lt;a href="https://huggingface.co/datasets/atitaarora/qdrant_doc_qna"&gt;prebuilt evaluation dataset&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;If you are struggling to make an evaluation dataset for your use case , you can use your documents and some techniques described in this &lt;a href="https://github.com/qdrant/qdrant-rag-eval/blob/master/synthetic_qna/notebook/Synthetic_question_generation.ipynb"&gt;notebook&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the RAG pipeline
&lt;/h3&gt;

&lt;p&gt;We establish the data preprocessing parameters essential for the RAG pipeline and configure the Qdrant vector database according to the specified criteria. &lt;/p&gt;

&lt;p&gt;Key parameters under consideration are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Chunk size&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk overlap&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embedding model&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of documents retrieved (retrieval window)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Following the ingestion of data in Qdrant, we proceed to retrieve pertinent documents corresponding to each query. These documents are then seamlessly integrated into our evaluation dataset, enriching the contextual information within the designated &lt;strong&gt;&lt;code&gt;context&lt;/code&gt;&lt;/strong&gt; column to fulfil the evaluation aspect.&lt;/p&gt;

&lt;p&gt;Next we define methods to take care of logistics with respect to adding documents to Qdrant&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This function adds documents to the desired Qdrant collection given the specified RAG parameters.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;## Processing each document with desired TEXT_SPLITTER_ALGO, CHUNK_SIZE, CHUNK_OVERLAP
&lt;/span&gt;    &lt;span class="n"&gt;text_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;add_start_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;docs_processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;langchain_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;docs_processed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;text_splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;## Processing documents to be encoded by Fastembed
&lt;/span&gt;    &lt;span class="n"&gt;docs_contents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;docs_metadatas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs_processed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page_content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;docs_contents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;docs_metadatas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Handle the case where attributes are missing
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Warning: Some documents do not have &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page_content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; attributes.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs_processed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs_contents&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs_metadatas&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;## Adding documents to Qdrant using desired embedding model
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;docs_metadatas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;docs_contents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and retrieving documents from Qdrant during our RAG Pipeline assessment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This function retrieves the desired number of documents from the Qdrant collection given a query.
    It returns a list of the retrieved documents.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;search_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;search_results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Setting up Quotient
&lt;/h3&gt;

&lt;p&gt;You will need an account log in, which you can get by requesting access on &lt;a href="https://www.quotientai.co/"&gt;Quotient's website&lt;/a&gt;. Once you have an account, you can create an API key by running the &lt;code&gt;quotient authenticate&lt;/code&gt; CLI command. &lt;/p&gt;

&lt;p&gt;💡 Be sure to save your API key, since it will only be displayed once (Note: you will not have to repeat this step again until your API key expires).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Once you have your API key, make sure to set it as an environment variable called &lt;code&gt;QUOTIENT_API_KEY&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import QuotientAI client and connect to QuotientAI
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;quotientai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuotientClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;quotientai.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;show_job_progress&lt;/span&gt;

&lt;span class="c1"&gt;# IMPORTANT: be sure to set your API key as an environment variable called QUOTIENT_API_KEY
# You will need this set before running the code below. You may also uncomment the following line and insert your API key:
# os.environ['QUOTIENT_API_KEY'] = "YOUR_API_KEY"
&lt;/span&gt;
&lt;span class="n"&gt;quotient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QuotientClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;QuotientAI&lt;/strong&gt; provides a seamless way to integrate &lt;em&gt;RAG evaluation&lt;/em&gt; into your applications. Here, we'll see how to use it to evaluate text generated from an LLM, based on retrieved knowledge from the Qdrant vector database.&lt;/p&gt;

&lt;p&gt;After retrieving the top similar documents and populating the &lt;code&gt;context&lt;/code&gt; column, we can submit the evaluation dataset to Quotient and execute an evaluation job. To run a job, all you need is your evaluation dataset and a &lt;code&gt;recipe&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;A recipe is a combination of a prompt template and a specified LLM.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quotient&lt;/strong&gt; orchestrates the evaluation run and handles version control and asset management throughout the experimentation process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Prior to assessing our RAG solution, it's crucial to outline our optimization goals.&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;In the context of &lt;em&gt;question-answering on Qdrant documentation&lt;/em&gt;, our focus extends beyond merely providing helpful responses. Ensuring the absence of any &lt;em&gt;inaccurate or misleading information&lt;/em&gt; is paramount. &lt;/p&gt;

&lt;p&gt;In other words, &lt;strong&gt;we want to minimize hallucinations&lt;/strong&gt; in the LLM outputs.&lt;/p&gt;

&lt;p&gt;For our evaluation, we will be considering the following metrics, with a focus on &lt;strong&gt;Faithfulness&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Context Relevance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk Relevance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROUGE-L&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BERT Sentence Similarity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BERTScore&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Evaluation in action
&lt;/h3&gt;

&lt;p&gt;The function below takes an evaluation dataset as input, which in this case contains questions and their corresponding answers. It retrieves relevant documents based on the questions in the dataset and populates the context field with this information from Qdrant. The prepared dataset is then submitted to QuotientAI for evaluation for the chosen metrics. After the evaluation is complete, the function displays aggregated statistics on the evaluation metrics followed by the summarized evaluation results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recipe_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_dataset_qdrant_questions.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This function evaluates the performance of a complete RAG pipeline on a given evaluation dataset.

    Given an evaluation dataset (containing questions and ground truth answers),
    this function retrieves relevant documents, populates the context field, and submits the dataset to QuotientAI for evaluation.
    Once the evaluation is complete, aggregated statistics on the evaluation metrics are displayed.

    The evaluation results are returned as a pandas dataframe.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Add context to each question by retrieving relevant documents
&lt;/span&gt;    &lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;get_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                                &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input_text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                                                &lt;span class="n"&gt;num_documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Now we'll save the eval_df to a CSV
&lt;/span&gt;    &lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Upload the eval dataset to QuotientAI
&lt;/span&gt;    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant-questions-eval-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Create a new task for the dataset
&lt;/span&gt;    &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dataset_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;qdrant-questions-qa-v1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question_answering&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run a job to evaluate the model
&lt;/span&gt;    &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;recipe_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recipe_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;num_fewshot_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metric_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Show the progress of the job
&lt;/span&gt;    &lt;span class="nf"&gt;show_job_progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Once the job is complete, we can get our results
&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_eval_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Add the results to a pandas dataframe to get statistics on performance
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json_normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric|completion_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]]&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;completion_time_ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Completion Time (ms)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Chunk Relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;selfcheckgpt_nli_relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context Relevance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;selfcheckgpt_nli&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Faithfulness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rougeL_fmeasure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ROUGE-L&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bert_score_f1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BERTScore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bert_sentence_similarity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BERT Sentence Similarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;completion_verbosity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Completion Verbosity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;verbosity_ratio&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Verbosity Ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,}&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_stats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()].&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

&lt;span class="n"&gt;main_metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Context Relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Chunk Relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Faithfulness&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ROUGE-L&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;BERT Sentence Similarity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;BERTScore&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Experimentation
&lt;/h2&gt;

&lt;p&gt;Our approach is rooted in the belief that improvement thrives in an environment of exploration and discovery. By systematically testing and tweaking various components of the RAG pipeline, we aim to incrementally enhance its capabilities and performance.&lt;/p&gt;

&lt;p&gt;In the following section, we dive into the details of our experimentation process, outlining the specific experiments conducted and the insights gained.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experiment 1 - Baseline
&lt;/h3&gt;

&lt;p&gt;Parameters &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Embedding Model: &lt;code&gt;bge-small-en&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk size: &lt;code&gt;512&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk overlap: &lt;code&gt;64&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of docs retrieved (Retireval Window): &lt;code&gt;3&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM: &lt;code&gt;Mistral-7B-Instruct&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ll process our documents based on configuration above and ingest them into Qdrant using &lt;code&gt;add_documents&lt;/code&gt; method introduced earlier&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#experiment1 - base config
&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
&lt;span class="n"&gt;chunk_overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;
&lt;span class="n"&gt;embedding_model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-small-en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;num_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experiment_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Outputs
#processed: 4504
#content:   4504
#metadata:  4504
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;COLLECTION_NAME&lt;/code&gt; which helps us segregate and identify our collections based on the experiments conducted.&lt;/p&gt;

&lt;p&gt;To proceed with the evaluation, let’s create the &lt;code&gt;evaluation recipe&lt;/code&gt; up next&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a recipe for the generator model and prompt template
&lt;/span&gt;&lt;span class="n"&gt;recipe_mistral&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_recipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_template_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mistral-7b-instruct-qa-with-rag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mistral-7b-instruct using a prompt template that includes context.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;recipe_mistral&lt;/span&gt;

&lt;span class="c1"&gt;#Outputs recipe JSON with the used prompt template
#'prompt_template': {'id': 1,
#  'name': 'Default Question Answering Template',
#  'variables': '["input_text","context"]',
#  'created_at': '2023-12-21T22:01:54.632367',
#  'template_string': 'Question: {input_text}\\n\\nContext: {context}\\n\\nAnswer:',
#  'owner_profile_id': None}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get a list of your existing recipes, you can simply run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_recipes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the recipe template is a simplest prompt using &lt;code&gt;Question&lt;/code&gt; from evaluation template &lt;code&gt;Context&lt;/code&gt; from document chunks retrieved from Qdrant and &lt;code&gt;Answer&lt;/code&gt; generated by the pipeline.&lt;/p&gt;

&lt;p&gt;To kick off the evaluation&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kick off an evaluation job
&lt;/span&gt;&lt;span class="n"&gt;experiment_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;recipe_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recipe_mistral&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_mistral.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This may take few minutes (depending on the size of evaluation dataset!)&lt;/p&gt;

&lt;p&gt;We can look at the results from our first (baseline) experiment as below :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frow9lmss3alskdh2qepq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frow9lmss3alskdh2qepq.png" alt="experiment1_eval.png" width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that we have a pretty &lt;strong&gt;low average Chunk Relevance&lt;/strong&gt; and &lt;strong&gt;very large standard deviations for both Chunk Relevance and Context Relevance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let's take a look at some of the lower performing datapoints with &lt;strong&gt;poor Faithfulness&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;display.max_colwidth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;experiment_1&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content.input_text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content.answer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content.documents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Chunk Relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Context Relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Faithfulness&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Faithfulness&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr9vm573by4ph4e8kc3bq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr9vm573by4ph4e8kc3bq.png" alt="experiment1_bad_examples.png" width="800" height="766"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In instances where the retrieved documents are &lt;strong&gt;irrelevant (where both Chunk Relevance and Context Relevance are low)&lt;/strong&gt;, the model also shows &lt;strong&gt;tendencies to hallucinate&lt;/strong&gt; and &lt;strong&gt;produce poor quality responses&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The quality of the retrieved text directly impacts the quality of the LLM-generated answer. Therefore, our focus will be on enhancing the RAG setup by &lt;strong&gt;adjusting the chunking parameters&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experiment 2 - Adjusting the chunk parameter
&lt;/h3&gt;

&lt;p&gt;Keeping all other parameters constant, we changed the &lt;code&gt;chunk size&lt;/code&gt; and &lt;code&gt;chunk overlap&lt;/code&gt; to see if we can improve our results.&lt;/p&gt;

&lt;p&gt;Parameters :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Embedding Model : &lt;code&gt;bge-small-en&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk size: &lt;code&gt;1024&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk overlap: &lt;code&gt;128&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of docs retrieved (Retireval Window): &lt;code&gt;3&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM: &lt;code&gt;Mistral-7B-Instruct&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will reprocess the data with the updated parameters above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;## for iteration 2 - lets modify chunk configuration
## We will start with creating seperate collection to store vectors
&lt;/span&gt;
&lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
&lt;span class="n"&gt;chunk_overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;
&lt;span class="n"&gt;embedding_model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-small-en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;num_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experiment_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Outputs
#processed: 2152
#content:   2152
#metadata:  2152
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Followed by running evaluation :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgjiy3g9fc2xhfx32ehdl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgjiy3g9fc2xhfx32ehdl.png" alt="experiment2_eval.png" width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and &lt;strong&gt;comparing it with the results from Experiment 1:&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttjhfft3yvcpi5thnj65.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttjhfft3yvcpi5thnj65.png" alt="graph_exp1_vs_exp2.png" width="800" height="567"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We observed slight enhancements in our LLM completion metrics (including BERT Sentence Similarity, BERTScore, ROUGE-L, and Knowledge F1) with the increase in &lt;em&gt;chunk size&lt;/em&gt;. However, it's noteworthy that there was a significant decrease in &lt;em&gt;Faithfulness&lt;/em&gt;, which is the primary metric we are aiming to optimize.&lt;/p&gt;

&lt;p&gt;Moreover, &lt;em&gt;Context Relevance&lt;/em&gt; demonstrated an increase, indicating that the RAG pipeline retrieved more relevant information required to address the query. Nonetheless, there was a considerable drop in &lt;em&gt;Chunk Relevance&lt;/em&gt;, implying that a smaller portion of the retrieved documents contained pertinent information for answering the question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The correlation between the rise in Context Relevance and the decline in Chunk Relevance suggests that retrieving more documents using the smaller chunk size might yield improved results.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Experiment 3 - Increasing the number of documents retrieved (retrieval window)
&lt;/h3&gt;

&lt;p&gt;This time, we are using the same RAG setup as &lt;code&gt;Experiment 1&lt;/code&gt;, but increasing the number of retrieved documents from &lt;strong&gt;3&lt;/strong&gt; to &lt;strong&gt;5&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Parameters :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Embedding Model : &lt;code&gt;bge-small-en&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk size: &lt;code&gt;512&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk overlap: &lt;code&gt;64&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of docs retrieved (Retrieval Window): &lt;code&gt;5&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM: : &lt;code&gt;Mistral-7B-Instruct&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can use the collection from Experiment 1 and run evaluation with modified &lt;code&gt;num_docs&lt;/code&gt; parameter as :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#collection name from Experiment 1
&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experiment_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;#running eval for experiment 3
&lt;/span&gt;&lt;span class="n"&gt;experiment_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;recipe_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recipe_mistral&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_mistral.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Observe the results as below : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzgbi7q5wg4p7wvbljvk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzgbi7q5wg4p7wvbljvk.png" alt="experiment_3_eval.png" width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Comparing the results with Experiments 1 and 2 :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3rh1y1rv70jwdminiw4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3rh1y1rv70jwdminiw4x.png" alt="graph_exp1_exp2_exp3.png" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As anticipated, employing the smaller chunk size while retrieving a larger number of documents resulted in achieving the highest levels of both &lt;em&gt;Context Relevance&lt;/em&gt; and &lt;em&gt;Chunk Relevance.&lt;/em&gt; Additionally, it yielded the &lt;strong&gt;best&lt;/strong&gt; (albeit marginal) &lt;em&gt;Faithfulness&lt;/em&gt; score, indicating a &lt;em&gt;reduced occurrence of inaccuracies or hallucinations&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Looks like we have achieved a good hold on our chunking parameters but it is worth testing another embedding model to see if we can get better results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experiment 4 - Changing the embedding model
&lt;/h3&gt;

&lt;p&gt;Let us try using &lt;strong&gt;MiniLM&lt;/strong&gt; for this experiment&lt;br&gt;
****Parameters :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Embedding Model : &lt;code&gt;MiniLM-L6-v2&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk size: &lt;code&gt;512&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk overlap: &lt;code&gt;64&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of docs retrieved (Retrieval Window): &lt;code&gt;5&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM: : &lt;code&gt;Mistral-7B-Instruct&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will have to create another collection for this experiment :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#experiment-4
&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;
&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;
&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;

&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experiment_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Outputs
#processed: 4504
#content:   4504
#metadata:  4504
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will observe our evaluations as : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3n3jaa7haux468lj7fta.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3n3jaa7haux468lj7fta.png" alt="experiment4_eval.png" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Comparing these with our previous experiments :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz9rrit3pjg4in0owhq84.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz9rrit3pjg4in0owhq84.png" alt="graph_exp1_exp2_exp3_exp4.png" width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It appears that &lt;code&gt;bge-small&lt;/code&gt; was more proficient in capturing the semantic nuances of the Qdrant Documentation.&lt;/p&gt;

&lt;p&gt;Up to this point, our experimentation has focused solely on the &lt;em&gt;retrieval aspect&lt;/em&gt; of our RAG pipeline. Now, let's explore altering the &lt;em&gt;generation aspect&lt;/em&gt; or LLM while retaining the optimal parameters identified in Experiment 3.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experiment 5 - Changing the LLM
&lt;/h3&gt;

&lt;p&gt;Parameters :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Embedding Model : &lt;code&gt;bge-small-en&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk size: &lt;code&gt;512&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk overlap: &lt;code&gt;64&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of docs retrieved (Retrieval Window): &lt;code&gt;5&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM: : &lt;code&gt;GPT-3.5-turbo&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this we can repurpose our collection from Experiment 3 while the evaluations to use a new recipe with  &lt;strong&gt;GPT-3.5-turbo&lt;/strong&gt; model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#collection name from Experiment 3
&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experiment_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;embedding_model_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# We have to create a recipe using the same prompt template and GPT-3.5-turbo
&lt;/span&gt;&lt;span class="n"&gt;recipe_gpt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quotient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_recipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_template_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gpt3.5-qa-with-rag-recipe-v1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GPT-3.5 using a prompt template that includes context.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;recipe_gpt&lt;/span&gt;

&lt;span class="c1"&gt;#Outputs
#{'id': 495,
# 'name': 'gpt3.5-qa-with-rag-recipe-v1',
# 'description': 'GPT-3.5 using a prompt template that includes context.',
# 'model_id': 5,
# 'prompt_template_id': 1,
# 'created_at': '2024-05-03T12:14:58.779585',
# 'owner_profile_id': 34,
# 'system_prompt_id': None,
# 'prompt_template': {'id': 1,
#  'name': 'Default Question Answering Template',
#  'variables': '["input_text","context"]',
#  'created_at': '2023-12-21T22:01:54.632367',
#  'template_string': 'Question: {input_text}\\n\\nContext: {context}\\n\\nAnswer:',
#  'owner_profile_id': None},
# 'model': {'id': 5,
#  'name': 'gpt-3.5-turbo',
#  'endpoint': 'https://api.openai.com/v1/chat/completions',
#  'revision': 'placeholder',
#  'created_at': '2024-02-06T17:01:21.408454',
#  'model_type': 'OpenAI',
#  'description': 'Returns a maximum of 4K output tokens.',
#  'owner_profile_id': None,
#  'external_model_config_id': None,
#  'instruction_template_cls': 'NoneType'}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running the evaluations as :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;experiment_5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;recipe_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recipe_gpt&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_docs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_gpt.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We observe : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka3rvdenj2d7tlnld1yu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka3rvdenj2d7tlnld1yu.png" alt="experiment5_eval.png" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and comparing all the 5 experiments as below : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmh8bxpxfpvpo7xep18cu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmh8bxpxfpvpo7xep18cu.png" alt="graph_exp1_exp2_exp3_exp4_exp5.png" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-3.5 surpassed Mistral-7B in all metrics&lt;/strong&gt;! Notably, Experiment 5 exhibited the &lt;strong&gt;lowest occurrence of hallucination&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;Let’s take a look at our results from all 5 experiments above &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3p3ddfuy7cgqdb8i4mu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3p3ddfuy7cgqdb8i4mu.png" alt="overall_eval_results.png" width="800" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We still have a long way to go in improving the retrieval performance of RAG, as indicated by our generally poor results thus far. It might be beneficial to &lt;strong&gt;explore alternative embedding models&lt;/strong&gt; or &lt;strong&gt;different retrieval strategies&lt;/strong&gt; to address this issue.&lt;/p&gt;

&lt;p&gt;The significant variations in &lt;em&gt;Context Relevance&lt;/em&gt; suggest that &lt;strong&gt;certain questions may necessitate retrieving more documents than others&lt;/strong&gt;. Therefore, investigating a &lt;strong&gt;dynamic retrieval strategy&lt;/strong&gt; could be worthwhile.&lt;/p&gt;

&lt;p&gt;Furthermore, there's ongoing &lt;strong&gt;exploration required on the generative aspect&lt;/strong&gt; of RAG. &lt;br&gt;
Modifying LLMs or prompts can substantially impact the overall quality of responses.&lt;/p&gt;

&lt;p&gt;This iterative process demonstrates how, starting from scratch, continual evaluation and adjustments throughout experimentation can lead to the development of an enhanced RAG system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch this workshop on YouTube
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A workshop version of this article is &lt;a href="https://www.youtube.com/watch?v=3MEMPZR1aZA"&gt;available on YouTube&lt;/a&gt;. Follow along using our &lt;a href="https://github.com/qdrant/qdrant-rag-eval/tree/master/workshop-rag-eval-qdrant-quotient"&gt;GitHub notebook&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>rag</category>
      <category>tutorial</category>
      <category>opensource</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Blog-Reading Chatbot with GPT-4o</title>
      <dc:creator>David Myriel</dc:creator>
      <pubDate>Tue, 14 May 2024 16:31:49 +0000</pubDate>
      <link>https://dev.to/qdrant/blog-reading-chatbot-with-gpt-4o-2kha</link>
      <guid>https://dev.to/qdrant/blog-reading-chatbot-with-gpt-4o-2kha</guid>
      <description>&lt;p&gt;In this tutorial, you will build a RAG system that combines blog content ingestion with the capabilities of semantic search. &lt;strong&gt;OpenAI’s GPT-4o LLM&lt;/strong&gt; is powerful, but scaling its use requires us to supply context systematically.&lt;/p&gt;

&lt;p&gt;RAG enhances the LLM’s generation of answers by retrieving relevant documents to aid the question-answering process. This setup showcases the integration of advanced search and AI language processing to improve information retrieval and generation tasks.&lt;/p&gt;

&lt;p&gt;A notebook for this tutorial is available on &lt;a href="https://github.com/qdrant/examples/blob/master/langchain-lcel-rag/Langchain-LCEL-RAG-Demo.ipynb" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Privacy and Sovereignty:&lt;/strong&gt; RAG applications often rely on sensitive or proprietary internal data. Running the entire stack within your own environment becomes crucial for maintaining control over this data. Qdrant Hybrid Cloud deployed on &lt;a href="https://www.scaleway.com/en/" rel="noopener noreferrer"&gt;Scaleway&lt;/a&gt; addresses this need perfectly, offering a secure, scalable platform that still leverages the full potential of RAG. Scaleway offers serverless &lt;a href="https://www.scaleway.com/en/serverless-functions/" rel="noopener noreferrer"&gt;Functions&lt;/a&gt; and serverless &lt;a href="https://www.scaleway.com/en/serverless-jobs/" rel="noopener noreferrer"&gt;Jobs&lt;/a&gt;, both of which are ideal for embedding creation in large-scale RAG cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Components
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Host:&lt;/strong&gt; Scaleway on managed Kubernetes for compatibility with Qdrant Hybrid Cloud.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector Database:&lt;/strong&gt; Qdrant Hybrid Cloud as the vector search engine for retrieval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM:&lt;/strong&gt; GPT-4o, developed by OpenAI is utilized as the generator for producing answers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework:&lt;/strong&gt; &lt;a href="https://www.langchain.com/" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt; for extensive RAG capabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm7hhu9s71ye9gabatk6f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm7hhu9s71ye9gabatk6f.png" alt="Architecture diagram" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&amp;gt; Langchain &lt;a href="https://python.langchain.com/v0.1/docs/integrations/chat/" rel="noopener noreferrer"&gt;supports a wide range of LLMs&lt;/a&gt;, and GPT-4o is used as the main generator in this tutorial. You can easily swap it out for your preferred model that might be launched on your premises to complete the fully private setup. For the sake of simplicity, we used the OpenAI APIs, but LangChain makes the transition seamless.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying Qdrant Hybrid Cloud on Scaleway
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.scaleway.com/en/kubernetes-kapsule/" rel="noopener noreferrer"&gt;Scaleway Kapsule&lt;/a&gt; and &lt;a href="https://www.scaleway.com/en/kubernetes-kosmos/" rel="noopener noreferrer"&gt;Kosmos&lt;/a&gt; are managed Kubernetes services from Scaleway. They abstract away the complexities of managing and operating a Kubernetes cluster. &lt;/p&gt;

&lt;p&gt;The primary difference being, Kapsule clusters are composed solely of Scaleway Instances. Whereas, a Kosmos cluster is a managed multi-cloud Kubernetes engine that allows you to connect instances from any cloud provider to a single managed Control-Plane.&lt;/p&gt;

&lt;p&gt;To start using managed Kubernetes on Scaleway, follow the &lt;a href="https://qdrant.tech/documentation/hybrid-cloud/platform-deployment-options/#scaleway" rel="noopener noreferrer"&gt;platform-specific documentation.&lt;/a&gt;&lt;br&gt;
Once your Kubernetes clusters are up, &lt;a href="https://qdrant.tech/documentation/hybrid-cloud/" rel="noopener noreferrer"&gt;you can begin deploying Qdrant Hybrid Cloud.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;To prepare the environment for working with Qdrant and related libraries, it’s necessary to install all required Python packages. This can be done using Poetry, a tool for dependency management and packaging in Python. The code snippet imports various libraries essential for the tasks ahead, including bs4 for parsing HTML and XML documents, langchain and its community extensions for working with language models and document loaders, and Qdrant for vector storage and retrieval. These imports lay the groundwork for utilizing Qdrant alongside other tools for natural language processing and machine learning tasks.&lt;/p&gt;

&lt;p&gt;Qdrant will be running on a specific URL and access will be restricted by the API key. Make sure to store them both as environment variables as well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export QDRANT_URL="https://qdrant.example.com"
export QDRANT_API_KEY="your-api-key"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Optional: Whenever you use LangChain, you can also &lt;a href="https://docs.smith.langchain.com/" rel="noopener noreferrer"&gt;configure LangSmith&lt;/a&gt;, which will help us trace, monitor and debug LangChain applications. You can sign up for LangSmith &lt;a href="https://smith.langchain.com/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-api-key"
export LANGCHAIN_PROJECT="your-project"  # if not specified, defaults to "default"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Now you can get started:

import getpass
import os

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_qdrant import Qdrant
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set up the OpenAI API key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;os.environ["OPENAI_API_KEY"] = getpass.getpass()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initialize the language model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
llm = ChatOpenAI(model="gpt-4o")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is here that we configure both the Embeddings and LLM. You can replace this with your own models using Ollama or other services. Scaleway has some great L4 GPU Instances you can use for compute here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Download and parse data
&lt;/h2&gt;

&lt;p&gt;To begin working with blog post contents, the process involves loading and parsing the HTML content. This is achieved using &lt;code&gt;urllib&lt;/code&gt; and &lt;code&gt;BeautifulSoup&lt;/code&gt;, which are tools designed for such tasks. After the content is loaded and parsed, it is indexed using Qdrant, a powerful tool for managing and querying vector data. The code snippet demonstrates how to load, chunk, and index the contents of a blog post by specifying the URL of the blog and the specific HTML elements to parse. This step is crucial for preparing the data for further processing and analysis with Qdrant.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Chunking data
&lt;/h3&gt;

&lt;p&gt;When dealing with large documents, such as a blog post exceeding 42,000 characters, it’s crucial to manage the data efficiently for processing. Many models have a limited context window and struggle with long inputs, making it difficult to extract or find relevant information. To overcome this, the document is divided into smaller chunks. This approach enhances the model’s ability to process and retrieve the most pertinent sections of the document effectively.&lt;/p&gt;

&lt;p&gt;In this scenario, the document is split into chunks using the &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; with a specified chunk size and overlap. This method ensures that no critical information is lost between chunks. Following the splitting, these chunks are then indexed into Qdrant—a vector database for efficient similarity search and storage of embeddings. The &lt;code&gt;Qdrant.from_documents&lt;/code&gt; function is utilized for indexing, with documents being the split chunks and embeddings generated through &lt;code&gt;OpenAIEmbeddings&lt;/code&gt;. The entire process is facilitated within an in-memory database, signifying that the operations are performed without the need for persistent storage, and the collection is named “lilianweng” for reference.&lt;/p&gt;

&lt;p&gt;This chunking and indexing strategy significantly improves the management and retrieval of information from large documents, making it a practical solution for handling extensive texts in data processing workflows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

vectorstore = Qdrant.from_documents(
    documents=splits, embedding=OpenAIEmbeddings(), location=":memory:", collection_name="lilianweng"
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Retrieve and generate content
&lt;/h2&gt;

&lt;p&gt;The vectorstore is used as a retriever to fetch relevant documents based on vector similarity. The hub.pull("rlm/rag-prompt") function is used to pull a specific prompt from a repository, which is designed to work with retrieved documents and a question to generate a response.&lt;/p&gt;

&lt;p&gt;The format_docs function formats the retrieved documents into a single string, preparing them for further processing. This formatted string, along with a question, is passed through a chain of operations. Firstly, the context (formatted documents) and the question are processed by the retriever and the prompt. Then, the result is fed into a large language model (llm) for content generation. Finally, the output is parsed into a string format using StrOutputParser().&lt;/p&gt;

&lt;p&gt;This chain of operations demonstrates a sophisticated approach to information retrieval and content generation, leveraging both the semantic understanding capabilities of vector search and the generative prowess of large language models.&lt;/p&gt;

&lt;p&gt;Now, retrieve and generate data using relevant snippets from the blogL&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Invoking the RAG Chain
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rag_chain.invoke("What is Task Decomposition?")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Next steps:
&lt;/h2&gt;

&lt;p&gt;We built a solid foundation for a simple chatbot, but there is still a lot to do. If you want to make the system production-ready, you should consider implementing the mechanism into your existing stack. We recommend&lt;/p&gt;

&lt;p&gt;Our vector database can easily be hosted on &lt;a href="https://www.scaleway.com/" rel="noopener noreferrer"&gt;Scaleway&lt;/a&gt;, our trusted &lt;a href="https://qdrant.tech/documentation/hybrid-cloud/" rel="noopener noreferrer"&gt;Qdrant Hybrid Cloud&lt;/a&gt; partner. This means that Qdrant can be run from your Scaleway region, but the database itself can still be managed from within Qdrant Cloud’s interface. Both products have been tested for compatibility and scalability, and we recommend their &lt;a href="https://www.scaleway.com/en/kubernetes-kapsule/" rel="noopener noreferrer"&gt;managed Kubernetes&lt;/a&gt; service. Their French deployment regions e.g. France are excellent for network latency and data sovereignty. For hosted GPUs, try &lt;a href="https://www.scaleway.com/en/l4-gpu-instance/" rel="noopener noreferrer"&gt;rendering with L4 GPU instance&lt;/a&gt;s.&lt;/p&gt;

&lt;p&gt;If you have any questions, feel free to ask on our &lt;a href="https://dev.toDiscord%20community"&gt;Discord community&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What is RAG (Retrieval-Augmented Generation)?</title>
      <dc:creator>Sabrina</dc:creator>
      <pubDate>Tue, 19 Mar 2024 16:47:57 +0000</pubDate>
      <link>https://dev.to/qdrant/what-is-rag-understanding-retrieval-augmented-generation-534n</link>
      <guid>https://dev.to/qdrant/what-is-rag-understanding-retrieval-augmented-generation-534n</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Retrieval-augmented generation (RAG) integrates external information retrieval into the process of generating responses by Large Language Models (LLMs). It searches a database for information beyond its pre-trained knowledge base, significantly improving the accuracy and relevance of the generated responses.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Language models have exploded on the internet ever since ChatGPT came out, and rightfully so. They can write essays, code entire programs, and even make memes (though we’re still deciding on whether that's a good thing).&lt;/p&gt;

&lt;p&gt;But as brilliant as these chatbots become, they still have &lt;strong&gt;limitations&lt;/strong&gt; in tasks requiring external knowledge and factual information. &lt;/p&gt;

&lt;p&gt;Yes, it can describe the honeybee's waggle dance in excruciating detail. But they become far more valuable if they can generate insights from &lt;strong&gt;any data&lt;/strong&gt; that we provide, rather than just their &lt;em&gt;original training data.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;Since retraining those large language models from scratch costs millions of dollars and takes months, we need better ways to give our existing LLMs access to our custom data.&lt;/p&gt;

&lt;p&gt;While you could be more creative with your prompts, it is only a short-term solution. LLMs can consider only a &lt;strong&gt;limited&lt;/strong&gt; amount of text in their responses, known as a &lt;a href="https://www.hopsworks.ai/dictionary/context-window-for-llms"&gt;context window&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Some models like GPT-3 can see up to around 12 pages of text (that’s 4,096 tokens of context). That’s not good enough for most knowledge bases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbo4w8jqvyqxduzq9fwi5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbo4w8jqvyqxduzq9fwi5.png" alt="How a RAG works" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The image above shows how a basic RAG system works. Before forwarding the question to the LLM, we have a layer that searches our knowledge base for the "relevant knowledge" to answer the user query. Specifically, in this case, the spending data from the last month. &lt;/p&gt;

&lt;p&gt;Our LLM can now generate a &lt;strong&gt;relevant non-hallucinated&lt;/strong&gt; response about our budget. &lt;/p&gt;

&lt;p&gt;As your data grows, you’ll need efficient ways to identify the most relevant information for your LLM's limited memory. This is where you’ll want a proper way to store and retrieve the specific data you’ll need for your query, without needing the LLM to remember it. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector databases&lt;/strong&gt; store information as &lt;strong&gt;vector embeddings&lt;/strong&gt;. This format supports efficient similarity searches to retrieve relevant data for your query. For example, Qdrant is specifically designed to perform fast, even in scenarios dealing with billions of vectors.&lt;/p&gt;

&lt;p&gt;This article will focus on RAG systems and architecture. If you’re interested in learning more about vector search, we recommend the following articles: &lt;a href="https://qdrant.tech/articles/what-is-a-vector-database/"&gt;What is a Vector Database?&lt;/a&gt; and &lt;a href="https://qdrant.tech/articles/what-are-embeddings/"&gt;What are Vector Embeddings?&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG architecture
&lt;/h2&gt;

&lt;p&gt;At its core, a RAG architecture includes the &lt;strong&gt;retriever&lt;/strong&gt; and the &lt;strong&gt;generator&lt;/strong&gt;. Let's start by understanding what each of these components does.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Retriever
&lt;/h3&gt;

&lt;p&gt;When you ask a question to the retriever, it uses &lt;strong&gt;similarity search&lt;/strong&gt; to scan through a vast knowledge base of vector embeddings. It then pulls out the most &lt;strong&gt;relevant&lt;/strong&gt; vectors to help answer that query. There are a few different techniques it can use to know what’s relevant:&lt;/p&gt;

&lt;h3&gt;
  
  
  How indexing works in RAG retrievers
&lt;/h3&gt;

&lt;p&gt;The indexing process organizes the data into your vector database in a way that makes it easily searchable. This allows the RAG to access relevant information when responding to a query.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9w6dvfzo4s4dep9t8ryc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9w6dvfzo4s4dep9t8ryc.png" alt="How indexing works" width="800" height="492"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As shown in the image above, here’s the process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with a &lt;em&gt;loader&lt;/em&gt; that gathers &lt;em&gt;documents&lt;/em&gt; containing your data. These documents could be anything from articles and books to web pages and social media posts. &lt;/li&gt;
&lt;li&gt;Next, a &lt;em&gt;splitter&lt;/em&gt; divides the documents into smaller chunks, typically sentences or paragraphs. &lt;/li&gt;
&lt;li&gt;This is because RAG models work better with smaller pieces of text. In the diagram, these are &lt;em&gt;document snippets&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Each text chunk is then fed into an &lt;em&gt;embedding machine&lt;/em&gt;. This machine uses complex algorithms to convert the text into &lt;a href="https://qdrant.tech/articles/what-are-embeddings/"&gt;vector embeddings&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All the generated vector embeddings are stored in a knowledge base of indexed information. This supports efficient retrieval of similar pieces of information when needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query vectorization
&lt;/h3&gt;

&lt;p&gt;Once you have vectorized your knowledge base you can do the same to the user query. When the model sees a new query, it uses the same preprocessing and embedding techniques. This ensures that the query vector is compatible with the document vectors in the index.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgqcvfupjx6pvtc9nttz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgqcvfupjx6pvtc9nttz.png" alt="How retrieval works" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval of relevant documents
&lt;/h3&gt;

&lt;p&gt;When the system needs to find the most relevant documents or passages to answer a query, it utilizes vector similarity techniques. &lt;strong&gt;Vector similarity&lt;/strong&gt; is a fundamental concept in machine learning and natural language processing (NLP) that quantifies the resemblance between vectors, which are mathematical representations of data points.&lt;/p&gt;

&lt;p&gt;The system can employ different vector similarity strategies depending on the type of vectors used to represent the data:&lt;/p&gt;

&lt;h4&gt;
  
  
  Sparse vector representations
&lt;/h4&gt;

&lt;p&gt;A sparse vector is characterized by a high dimensionality, with most of its elements being zero.&lt;/p&gt;

&lt;p&gt;The classic approach is &lt;strong&gt;keyword search&lt;/strong&gt;, which scans documents for the exact words or phrases in the query. The search creates sparse vector representations of documents by counting word occurrences and inversely weighting common words. Queries with rarer words get prioritized.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fez1mthlieogdyfv6lvlc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fez1mthlieogdyfv6lvlc.png" alt="Sparse Vectors explanation" width="800" height="299"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf"&gt;TF-IDF&lt;/a&gt; (Term Frequency-Inverse Document Frequency) and &lt;a href="https://en.wikipedia.org/wiki/Okapi_BM25"&gt;BM25&lt;/a&gt; are two classic related algorithms. They're simple and computationally efficient. However, they can struggle with synonyms and don't always capture semantic similarities.&lt;/p&gt;

&lt;p&gt;If you’re interested in going deeper, refer to our article on &lt;a href="https://qdrant.tech/articles/sparse-vectors/"&gt;Sparse Vectors&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Dense vector embeddings
&lt;/h4&gt;

&lt;p&gt;This approach uses large language models like &lt;a href="https://en.wikipedia.org/wiki/BERT_(language_model)"&gt;BERT&lt;/a&gt; to encode the query and passages into dense vector embeddings. These models are compact numerical representations that capture semantic meaning. Vector databases like Qdrant store these embeddings, allowing retrieval based on &lt;strong&gt;semantic similarity&lt;/strong&gt; rather than just keywords using distance metrics like cosine similarity.&lt;/p&gt;

&lt;p&gt;This allows the retriever to match based on semantic understanding rather than just keywords. So if I ask about "compounds that cause BO," it can retrieve relevant info about "molecules that create body odor" even if those exact words weren't used. We explain more about it in our &lt;a href="https://qdrant.tech/articles/what-are-embeddings/"&gt;What are Vector Embeddings&lt;/a&gt; article.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hybrid search
&lt;/h3&gt;

&lt;p&gt;However, neither keyword search nor vector search are always perfect. Keyword search may miss relevant information expressed differently, while vector search can sometimes struggle with specificity or neglect important statistical word patterns. Hybrid methods aim to combine the strengths of different techniques.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws5e60ew6dawaobvf3v6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws5e60ew6dawaobvf3v6.png" alt="Hybrid search overview" width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some common hybrid approaches include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using keyword search to get an initial set of candidate documents. Next, the documents are re-ranked/re-scored using semantic vector representations.&lt;/li&gt;
&lt;li&gt;Starting with semantic vectors to find generally topically relevant documents. Next, the documents are filtered/re-ranked e based on keyword matches or other metadata.&lt;/li&gt;
&lt;li&gt;Considering both semantic vector closeness and statistical keyword patterns/weights in a combined scoring model.&lt;/li&gt;
&lt;li&gt;Having multiple stages were different techniques. One example: start with an initial keyword retrieval, followed by semantic re-ranking, then a final re-ranking using even more complex models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you combine the powers of different search methods in a complementary way, you can provide higher quality, more comprehensive results. Check out our article on &lt;a href="https://qdrant.tech/articles/hybrid-search/"&gt;Hybrid Search&lt;/a&gt; if you’d like to learn more.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Generator
&lt;/h2&gt;

&lt;p&gt;With the top relevant passages retrieved, it's now the generator's job to produce a final answer by synthesizing and expressing that information in natural language. &lt;/p&gt;

&lt;p&gt;The LLM is typically a model like GPT, BART or T5, trained on massive datasets to understand and generate human-like text. It now takes not only the query (or question) as input but also the relevant documents or passages that the retriever identified as potentially containing the answer to generate its response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlq9rw4mw26gvf4qxl9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlq9rw4mw26gvf4qxl9r.png" alt="How a Generator works" width="800" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The retriever and generator don't operate in isolation. The image bellow shows how the output of the retrieval feeds the generator to produce the final generated response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckvyszdo4cyklt09ts38.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckvyszdo4cyklt09ts38.png" alt="The entire architecture of a RAG system" width="800" height="645"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where is RAG being used?
&lt;/h2&gt;

&lt;p&gt;Because of their more knowledgeable and contextual responses, we can find RAG models being applied in many areas today, especially those who need factual accuracy and knowledge depth. &lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Applications:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question answering:&lt;/strong&gt; This is perhaps the most prominent use case for RAG models. They power advanced question-answering systems that can retrieve relevant information from large knowledge bases and then generate fluent answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Language generation:&lt;/strong&gt; RAG enables more factual and contextualized text generation for contextualized text summarization from multiple sources&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data-to-text generation:&lt;/strong&gt; By retrieving relevant structured data, RAG models can generate product/business intelligence reports from databases or describing insights from data visualizations and charts&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multimedia understanding:&lt;/strong&gt; RAG isn't limited to text - it can retrieve multimodal information like images, video, and audio to enhance understanding. Answering questions about images/videos by retrieving relevant textual context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating your first RAG chatbot with Langchain, Groq, and OpenAI
&lt;/h2&gt;

&lt;p&gt;Are you ready to create your own RAG chatbot from the ground up? We have a video explaining everything from the beginning. Daniel Romero’s will guide you through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up your chatbot&lt;/li&gt;
&lt;li&gt;Preprocessing and organizing data for your chatbot's use&lt;/li&gt;
&lt;li&gt;Applying vector similarity search algorithms&lt;/li&gt;
&lt;li&gt;Enhancing the efficiency and response quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After building your RAG chatbot, you'll be able to evaluate its performance against that of a chatbot powered solely by a Large Language Model (LLM).&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/O60-KuZZeQA"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s next?
&lt;/h2&gt;

&lt;p&gt;Have a RAG project you want to bring to life? Join our &lt;a href="//discord.gg/qdrant"&gt;Discord community&lt;/a&gt; where we’re always sharing tips and answering questions on vector search and retrieval.&lt;/p&gt;

&lt;p&gt;Learn more about how to properly evaluate your RAG responses: &lt;a href="https://superlinked.com/vectorhub/evaluating-retrieval-augmented-generation-a-framework-for-assessment"&gt;Evaluating Retrieval Augmented Generation - a framework for assessment&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Qdrant 1.8.0 - Major Performance Enhancements</title>
      <dc:creator>David Myriel</dc:creator>
      <pubDate>Fri, 08 Mar 2024 16:41:12 +0000</pubDate>
      <link>https://dev.to/qdrant/qdrant-180-major-performance-enhancements-384o</link>
      <guid>https://dev.to/qdrant/qdrant-180-major-performance-enhancements-384o</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/qdrant/qdrant/releases/tag/v1.8.0" rel="noopener noreferrer"&gt;Qdrant 1.8.0 is out!&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This time around, we have focused on Qdrant's internals. Our goal was to optimize performance, so that your existing setup can run faster and save on compute. Here is what we've been up to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Faster sparse vectors:&lt;/strong&gt; Hybrid search is up to 16x faster now!&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU resource management:&lt;/strong&gt; You can allocate CPU threads for faster indexing. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better indexing performance:&lt;/strong&gt; We optimized text indexing on the backend.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Faster search with sparse vectors
&lt;/h2&gt;

&lt;p&gt;Search throughput is now up to 16 times faster for sparse vectors. If you are &lt;a href="https://dev.to/articles/sparse-vectors/"&gt;using Qdrant for hybrid search&lt;/a&gt;, this means that you can now handle up to sixteen times as many queries. This improvement comes from extensive backend optimizations aimed at increasing efficiency and capacity. &lt;/p&gt;

&lt;p&gt;What this means for your setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query speed:&lt;/strong&gt; The time it takes to run a search query has been significantly reduced. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search capacity:&lt;/strong&gt; Qdrant can now handle a much larger volume of search requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User experience:&lt;/strong&gt; Results will appear faster, leading to a smoother experience for the user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; You can easily accomodate rapidly growing users or an expanding dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Sparse vectors benchmark
&lt;/h3&gt;

&lt;p&gt;Performance results are publicly available for you to test. Qdrant's R&amp;amp;D developed a dedicated &lt;a href="https://github.com/qdrant/sparse-vectors-benchmark" rel="noopener noreferrer"&gt;open-source benchmarking tool&lt;/a&gt; just to test sparse vector performance.&lt;/p&gt;

&lt;p&gt;A real-life simulation of sparse vector queries was run against the &lt;a href="https://big-ann-benchmarks.com/neurips23.html" rel="noopener noreferrer"&gt;NeurIPS 2023 dataset&lt;/a&gt;. All tests were done on an 8 CPU machine on Azure. &lt;/p&gt;

&lt;p&gt;Latency (y-axis) has dropped significantly for queries. You can see the before/after here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9kgj248oqlxs4i4wvvmy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9kgj248oqlxs4i4wvvmy.png" alt="dropping latency" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; Dropping latency in sparse vector search queries across versions 1.7-1.8.&lt;/p&gt;

&lt;p&gt;The colors within both scatter plots show the frequency of results. The red dots show that the highest concentration is around 2200ms (before) and 135ms (after). This tells us that latency for sparse vectors queries dropped by about a factor of 16. Therefore, the time it takes to retrieve an answer with Qdrant is that much shorter. &lt;/p&gt;

&lt;p&gt;This performance increase can have a dramatic effect on hybrid search implementations. &lt;a href="https://dev.to/articles/sparse-vectors/"&gt;Read more about how to set this up.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;FYI, sparse vectors were released in &lt;a href="///articles/qdrant-1.7.x/#sparse-vectors"&gt;Qdrant v.1.7.0&lt;/a&gt;. They are stored using a different index, so first &lt;a href="https://dev.to/documentation/concepts/indexing/#sparse-vector-index"&gt;check out the documentation&lt;/a&gt; if you want to try an implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU resource management
&lt;/h2&gt;

&lt;p&gt;Indexing is Qdrant’s most resource-intensive process. Now you can account for this by allocating compute use specifically to indexing. You can assign a number CPU resources towards indexing and leave the rest for search. As a result, indexes will build faster, and search quality will remain unaffected.&lt;/p&gt;

&lt;p&gt;This isn't mandatory, as Qdrant is by default tuned to strike the right balance between indexing and search. However, if you wish to define specific CPU usage, you will need to do so from &lt;code&gt;config.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This version introduces a &lt;code&gt;optimizer_cpu_budget&lt;/code&gt; parameter to control the maximum number of CPUs used for indexing. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read more about &lt;code&gt;config.yaml&lt;/code&gt; in the &lt;a href="https://dev.to/documentation/guides/configuration/"&gt;configuration file&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CPU budget, how many CPUs (threads) to allocate for an optimization job.&lt;/span&gt;
&lt;span class="na"&gt;optimizer_cpu_budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;If left at 0, Qdrant will keep 1 or more CPUs unallocated - depending on CPU size.&lt;/li&gt;
&lt;li&gt;If the setting is positive, Qdrant will use this exact number of CPUs for indexing.&lt;/li&gt;
&lt;li&gt;If the setting is negative, Qdrant will subtract this number of CPUs from the available CPUs for indexing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most users, the default &lt;code&gt;optimizer_cpu_budget&lt;/code&gt; setting will work well. We only recommend you use this if your indexing load is significant.&lt;/p&gt;

&lt;p&gt;Our backend leverages dynamic CPU saturation to increase indexing speed. For that reason, the impact on search query performance ends up being minimal. Ultimately, you will be able to strike a the best possible balance between indexing times and search performance.&lt;/p&gt;

&lt;p&gt;This configuration can be done at any time, but it requires a restart of Qdrant. Changing it affects both existing and new collections. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This feature is not configurable on &lt;a href="https://qdrant.to/cloud" rel="noopener noreferrer"&gt;Qdrant Cloud&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Better indexing for text data
&lt;/h2&gt;

&lt;p&gt;In order to minimize your RAM expenditure, we have developed a new way to index specific types of data. Please keep in mind that this is a backend improvement, and you won't need to configure anything. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Going forward, if you are indexing immutable text fields, we estimate a 10% reduction in RAM loads. Our benchmark result is based on a system that uses 64GB of RAM. If you are using less RAM, this reduction might be higher than 10%.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Immutable text fields are static and do not change once they are added to Qdrant. These entries usually represent some type of an attribute, description or a tag. Vectors associated with them can be indexed more efficiently, since you don’t need to re-index them anymore. Conversely, mutable fields are dynamic and can be modified after their initial creation. Please keep in mind that they will continue to require additional RAM.&lt;/p&gt;

&lt;p&gt;This approach ensures stability in the vector search index, with faster and more consistent operations. We achieved this by setting up a field index which helps minimize what is stored. To improve search performance we have also optimized the way we load documents for searches with a text field index. Now our backend loads documents mostly sequentially and in increasing order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Minor improvements and new features
&lt;/h2&gt;

&lt;p&gt;Beyond these enhancements, &lt;a href="https://github.com/qdrant/qdrant/releases/tag/v1.8.0" rel="noopener noreferrer"&gt;Qdrant v1.8.0&lt;/a&gt; adds and improves on several smaller features:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Order points by payload:&lt;/strong&gt; In addition to searching for semantic results, you might want to retrieve results by specific metadata (such as price). You can now use Scroll API to &lt;a href="https://dev.to/documentation/concepts/points/#order-points-by-payload-key"&gt;order points by payload key&lt;/a&gt;. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datetime support:&lt;/strong&gt; We have implemented &lt;a href="https://dev.to/documentation/concepts/filtering/#datetime-range"&gt;datetime support for the payload index&lt;/a&gt;. Prior to this, if you wanted to search for a specific datetime range, you would have had to convert dates to UNIX timestamps. (&lt;a href="https://github.com/qdrant/qdrant/issues/3320" rel="noopener noreferrer"&gt;PR#3320&lt;/a&gt;) &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check collection existence:&lt;/strong&gt; You can check whether a collection exists via the &lt;code&gt;/exists&lt;/code&gt; endpoint to the &lt;code&gt;/collections/{collection_name}&lt;/code&gt;. You will get a true/false response. (&lt;a href="https://github.com/qdrant/qdrant/pull/3472" rel="noopener noreferrer"&gt;PR#3472&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Find points&lt;/strong&gt; whose payloads match more than the minimal amount of conditions. We included the &lt;code&gt;min_should&lt;/code&gt; match feature for a condition to be &lt;code&gt;true&lt;/code&gt; (&lt;a href="https://github.com/qdrant/qdrant/pull/3466/" rel="noopener noreferrer"&gt;PR#3331&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modify nested fields:&lt;/strong&gt; We have improved the &lt;code&gt;set_payload&lt;/code&gt; API, adding the ability to update nested fields (&lt;a href="https://github.com/qdrant/qdrant/pull/3548" rel="noopener noreferrer"&gt;PR#3548&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Release notes
&lt;/h2&gt;

&lt;p&gt;For more information, see &lt;a href="https://github.com/qdrant/qdrant/releases/tag/v1.8.0" rel="noopener noreferrer"&gt;our release notes&lt;/a&gt;. &lt;br&gt;
Qdrant is an open source project. We welcome your contributions; raise &lt;a href="https://github.com/qdrant/qdrant/issues" rel="noopener noreferrer"&gt;issues&lt;/a&gt;, or contribute via &lt;a href="https://github.com/qdrant/qdrant/pulls" rel="noopener noreferrer"&gt;pull requests&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>vectordatabase</category>
      <category>devops</category>
      <category>news</category>
      <category>database</category>
    </item>
    <item>
      <title>RAG is Dead. Long Live RAG!</title>
      <dc:creator>David Myriel</dc:creator>
      <pubDate>Wed, 28 Feb 2024 21:44:14 +0000</pubDate>
      <link>https://dev.to/qdrant/rag-is-dead-long-live-rag-5k8</link>
      <guid>https://dev.to/qdrant/rag-is-dead-long-live-rag-5k8</guid>
      <description>&lt;p&gt;When Anthropic came out with a context window of 100K tokens, they said: “&lt;em&gt;Vector search is dead. LLMs are getting more accurate and won’t need RAG anymore.&lt;/em&gt;”&lt;/p&gt;

&lt;p&gt;Google’s Gemini 1.5 now offers a context window of 10 million tokens. &lt;a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf" rel="noopener noreferrer"&gt;Their supporting paper&lt;/a&gt; claims victory over accuracy issues, even when applying Greg Kamradt’s &lt;a href="https://twitter.com/GregKamradt/status/1722386725635580292" rel="noopener noreferrer"&gt;NIAH methodology&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;It’s over. RAG must be completely obsolete now. Right?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;Larger context windows are never the solution. Let me repeat. Never. They require more computational resources and lead to slower processing times. &lt;/p&gt;

&lt;p&gt;The community is already stress testing Gemini 1.5: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftait4tne3qsw6yl3hzch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftait4tne3qsw6yl3hzch.png" alt="rag-is-dead-1.png" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is not surprising. LLMs require massive amounts of compute and memory to run. To cite Grant, running such a model by itself “would deplete a small coal mine to generate each completion”. Also, who is waiting 30 seconds for a response?&lt;/p&gt;

&lt;h2&gt;
  
  
  Context stuffing is not the solution
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Relying on context is expensive, and it doesn’t improve response quality in real-world applications. Retrieval based on vector search offers much higher precision.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you solely rely on an LLM to perfect retrieval and precision, you are doing it wrong. &lt;/p&gt;

&lt;p&gt;A large context window makes it harder to focus on relevant information. This increases the risk of errors or hallucinations in its responses. &lt;/p&gt;

&lt;p&gt;Google found Gemini 1.5 significantly more accurate than GPT-4 at shorter context lengths and “a very small decrease in recall towards 1M tokens”. The recall is still below 0.8.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefu8rcww1dvr3hakh4pm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefu8rcww1dvr3hakh4pm.png" alt="rag-is-dead-2" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We don’t think 60-80% is good enough. The LLM might retrieve enough relevant facts in its context window, but it still loses up to 40% of the available information.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The whole point of vector search is to circumvent this process by efficiently picking the information your app needs to generate the best response. A vector database keeps the compute load low and the query response fast. You don’t need to wait for the LLM at all.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Qdrant’s benchmark results are strongly in favor of accuracy and efficiency. We recommend that you consider them before deciding that an LLM is enough. Take a look at our &lt;a href="https://qdrant.tech/benchmarks/" rel="noopener noreferrer"&gt;open-source benchmark reports&lt;/a&gt; and &lt;a href="https://github.com/qdrant/vector-db-benchmark" rel="noopener noreferrer"&gt;try out the tests&lt;/a&gt; yourself. &lt;/p&gt;

&lt;h2&gt;
  
  
  Vector search in compound systems
&lt;/h2&gt;

&lt;p&gt;The future of AI lies in careful system engineering. As per &lt;a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/" rel="noopener noreferrer"&gt;Zaharia et al.&lt;/a&gt;, results from Databricks find that “60% of LLM applications use some form of RAG, while 30% use multi-step chains.” &lt;/p&gt;

&lt;p&gt;Even Gemini 1.5 demonstrates the need for a complex strategy. When looking at &lt;a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf" rel="noopener noreferrer"&gt;Google’s MMLU Benchmark&lt;/a&gt;, the model was called 32 times to reach a score of 90.0% accuracy. This shows us that even a basic compound arrangement is superior to monolithic models. &lt;/p&gt;

&lt;p&gt;As a retrieval system, a vector database perfectly fits the need for compound systems. Introducing them into your design opens the possibilities for superior applications of LLMs. It is superior because it’s faster, more accurate, and much cheaper to run. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The key advantage of RAG is that it allows an LLM to pull in real-time information from up-to-date internal and external knowledge sources, making it more dynamic and adaptable to new information. - Oliver Molander, CEO of IMAGINAI&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Qdrant scales to enterprise RAG scenarios
&lt;/h2&gt;

&lt;p&gt;People still don’t understand the economic benefit of vector databases. Why would a large corporate AI system need a stand-alone vector db like Qdrant? In our minds, this is the most important question. Let’s pretend that LLMs cease struggling with context thresholds altogether. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much would all of this cost?&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;If you are running a RAG solution in an enterprise environment with petabytes of private data, your compute bill will be unimaginable. Let's assume 1 cent per 1K input tokens (which is the current GPT-4 Turbo pricing). Whatever you are doing, every time you go 100 thousand tokens deep, it will cost you $1. &lt;/p&gt;

&lt;p&gt;That’s a buck a question. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;According to our estimations, vector search queries are &lt;strong&gt;at least&lt;/strong&gt; 100 million times cheaper than queries made by LLMs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Conversely, the only up-front investment with vector databases is the indexing (which requires more compute). After this step, everything else is a breeze. Once setup, Qdrant easily scales via &lt;a href="https://qdrant.tech/articles/multitenancy/" rel="noopener noreferrer"&gt;features like Multitenancy and Sharding&lt;/a&gt;. This lets you scale up your reliance on the vector retrieval process and minimize your use of the compute-heavy LLMs. As an optimization  measure, Qdrant is irreplaceable. &lt;/p&gt;

&lt;p&gt;Julien Simon from HuggingFace says it best:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;RAG is not a workaround for limited context size. For mission-critical enterprise use cases, RAG is a way to leverage high-value, proprietary company knowledge that will never be found in public datasets used for LLM training. At the moment, the best place to index and query this knowledge is some sort of vector index. In addition, RAG downgrades the LLM to a writing assistant. Since built-in knowledge becomes much less important, a nice small 7B open-source model usually does the trick at a fraction of the cost of a huge generic model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Long Live RAG
&lt;/h2&gt;

&lt;p&gt;As LLMs continue to require enormous computing power, users will need to leverage vector search and RAG.&lt;/p&gt;

&lt;p&gt;Our customers remind us of this fact every day. As a product, our vector database is highly scalable and business-friendly. We develop our features strategically to follow our company’s Unix philosophy. &lt;/p&gt;

&lt;p&gt;We want to keep Qdrant compact, efficient and with a focused purpose. This purpose is to empower our customers to use it however they see fit. &lt;/p&gt;

&lt;p&gt;When large enterprises release their generative AI into production, they need to keep costs under control, while retaining the best possible quality of responses. Qdrant has the tools to do just that. &lt;/p&gt;

&lt;p&gt;Whether through &lt;a href="https://qdrant.tech/articles/vector-similarity-beyond-search/" rel="noopener noreferrer"&gt;RAG, Semantic Search, Dissimilarity Search, Recommendations or Multimodality&lt;/a&gt; - Qdrant will continue to journey on.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Enhance OpenAI Embeddings with Qdrant's Binary Quantization</title>
      <dc:creator>Nirant</dc:creator>
      <pubDate>Fri, 23 Feb 2024 17:46:21 +0000</pubDate>
      <link>https://dev.to/qdrant/enhance-openai-embeddings-with-qdrants-binary-quantization-3bcd</link>
      <guid>https://dev.to/qdrant/enhance-openai-embeddings-with-qdrants-binary-quantization-3bcd</guid>
      <description>&lt;p&gt;OpenAI Ada-003 embeddings are a powerful tool for natural language processing (NLP). However, the size of the embeddings are a challenge, especially with real-time search and retrieval. In this article, we explore how you can use Qdrant's Binary Quantization to enhance the performance and efficiency of OpenAI embeddings.&lt;/p&gt;

&lt;p&gt;In this post, we discuss:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The significance of OpenAI embeddings and real-world challenges. &lt;/li&gt;
&lt;li&gt;Qdrant's Binary Quantization, and how it can improve the performance of OpenAI embeddings&lt;/li&gt;
&lt;li&gt;Results of an experiment that highlights improvements in search efficiency and accuracy&lt;/li&gt;
&lt;li&gt;Implications of these findings for real-world applications&lt;/li&gt;
&lt;li&gt;Best practices for leveraging Binary Quantization to enhance OpenAI embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also try out these techniques as described in &lt;a href="https://github.com/qdrant/examples/blob/openai-3/binary-quantization-openai/README.md"&gt;Binary Quantization OpenAI&lt;/a&gt;, which includes Jupyter notebooks.&lt;/p&gt;

&lt;h2&gt;
  
  
  New OpenAI Embeddings: Performance and Changes
&lt;/h2&gt;

&lt;p&gt;As the technology of embedding models has advanced, demand has grown. Users are looking more for powerful and efficient text-embedding models. OpenAI's Ada-003 embeddings offer state-of-the-art performance on a wide range of NLP tasks, including those noted in &lt;a href="https://huggingface.co/spaces/mteb/leaderboard"&gt;MTEB&lt;/a&gt; and &lt;a href="https://openai.com/blog/new-embedding-models-and-api-updates"&gt;MIRACL&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;These models include multilingual support in over 100 languages. The transition from text-embedding-ada-002 to text-embedding-3-large has led to a significant jump in performance scores (from 31.4% to 54.9% on MIRACL).&lt;/p&gt;

&lt;h4&gt;
  
  
  Matryoshka Representation Learning
&lt;/h4&gt;

&lt;p&gt;The new OpenAI models have been trained with a novel approach called "&lt;a href="https://aniketrege.github.io/blog/2024/mrl/"&gt;Matryoshka Representation Learning&lt;/a&gt;". Developers can set up embeddings of different sizes (number of dimensions). In this post, we use small and large variants. Developers can select embeddings which balances accuracy and size.&lt;/p&gt;

&lt;p&gt;Here, we show how the accuracy of binary quantization is quite good across different dimensions -- for both the models. &lt;/p&gt;

&lt;h2&gt;
  
  
  Enhanced Performance and Efficiency with Binary Quantization
&lt;/h2&gt;

&lt;p&gt;By reducing storage needs, you can scale applications with lower costs. This addresses a critical challenge posed by the original embedding sizes. Binary Quantization also speeds the search process. It simplifies the complex distance calculations between vectors into more manageable bitwise operations, which supports potentially real-time searches across vast datasets. &lt;/p&gt;

&lt;p&gt;The accompanying graph illustrates the promising accuracy levels achievable with binary quantization across different model sizes, showcasing its practicality without severely compromising on performance. This dual advantage of storage reduction and accelerated search capabilities underscores the transformative potential of Binary Quantization in deploying OpenAI embeddings more effectively across various real-world applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ltutmrqrcwxq2x8bc25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ltutmrqrcwxq2x8bc25.png" alt="Image description" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The efficiency gains from Binary Quantization are as follows: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced storage footprint: It helps with large-scale datasets. It also saves on memory, and scales up to 30x at the same cost. &lt;/li&gt;
&lt;li&gt;Enhanced speed of data retrieval: Smaller data sizes generally leads to faster searches. &lt;/li&gt;
&lt;li&gt;Accelerated search process: It is based on simplified distance calculations between vectors to bitwise operations. This enables real-time querying even in extensive databases.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Experiment Setup: OpenAI Embeddings in Focus
&lt;/h3&gt;

&lt;p&gt;To identify Binary Quantization's impact on search efficiency and accuracy, we designed our experiment on OpenAI text-embedding models. These models, which capture nuanced linguistic features and semantic relationships, are the backbone of our analysis. We then delve deep into the potential enhancements offered by Qdrant's Binary Quantization feature.&lt;/p&gt;

&lt;p&gt;This approach not only leverages the high-caliber OpenAI embeddings but also provides a broad basis for evaluating the search mechanism under scrutiny.&lt;/p&gt;

&lt;h4&gt;
  
  
  Dataset
&lt;/h4&gt;

&lt;p&gt;The research employs 100K random samples from the &lt;a href="https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M"&gt;OpenAI 1M&lt;/a&gt; 1M dataset, focusing on 100 randomly selected records. These records serve as queries in the experiment, aiming to assess how Binary Quantization influences search efficiency and precision within the dataset. We then use the embeddings of the queries to search for the nearest neighbors in the dataset. &lt;/p&gt;

&lt;h4&gt;
  
  
  Parameters: Oversampling, Rescoring, and Search Limits
&lt;/h4&gt;

&lt;p&gt;For each record, we run a parameter sweep over the number of oversampling, rescoring, and search limits. We can then understand the impact of these parameters on search accuracy and efficiency. Our experiment was designed to assess the impact of Binary Quantization under various conditions, based on the following parameters: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Oversampling&lt;/strong&gt;: By oversampling, we can limit the loss of information inherent in quantization. This also helps to preserve the semantic richness of your OpenAI embeddings. We experimented with different oversampling factors, and identified the impact on the accuracy and efficiency of search. Spoiler: higher oversampling factors tend to improve the accuracy of searches. However, they usually require more computational resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rescoring&lt;/strong&gt;: Rescoring refines the first results of an initial binary search. This process leverages the original high-dimensional vectors to refine the search results, &lt;strong&gt;always&lt;/strong&gt; improving accuracy. We toggled rescoring on and off to measure effectiveness, when combined with Binary Quantization. We also measured the impact on search performance. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Search Limits&lt;/strong&gt;: We specify the number of results from the search process. We experimented with various search limits to measure their impact the accuracy and efficiency. We explored the trade-offs between search depth and performance. The results provide insight for applications with different precision and speed requirements.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Through this detailed setup, our experiment sought to shed light on the nuanced interplay between Binary Quantization and the high-quality embeddings produced by OpenAI's models. By meticulously adjusting and observing the outcomes under different conditions, we aimed to uncover actionable insights that could empower users to harness the full potential of Qdrant in combination with OpenAI's embeddings, regardless of their specific application needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results: Binary Quantization's Impact on OpenAI Embeddings
&lt;/h3&gt;

&lt;p&gt;To analyze the impact of rescoring (&lt;code&gt;True&lt;/code&gt; or &lt;code&gt;False&lt;/code&gt;), we compared results across different model configurations and search limits. Rescoring sets up a more precise search, based on results from an initial query.&lt;/p&gt;

&lt;h4&gt;
  
  
  Rescoring
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fov6p6n1csk72ndjqlg9a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fov6p6n1csk72ndjqlg9a.png" alt="Image description" width="800" height="641"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are some key observations, which analyzes the impact of rescoring (&lt;code&gt;True&lt;/code&gt; or &lt;code&gt;False&lt;/code&gt;):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Significantly Improved Accuracy&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Across all models and dimension configurations, enabling rescoring (&lt;code&gt;True&lt;/code&gt;) consistently results in higher accuracy scores compared to when rescoring is disabled (&lt;code&gt;False&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;The improvement in accuracy is true across various search limits (10, 20, 50, 100).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model and Dimension Specific Observations&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the &lt;code&gt;text-embedding-3-large&lt;/code&gt; model with 3072 dimensions, rescoring boosts the accuracy from an average of about 76-77% without rescoring to 97-99% with rescoring, depending on the search limit and oversampling rate.&lt;/li&gt;
&lt;li&gt;The accuracy improvement with increased oversampling is more pronounced when rescoring is enabled, indicating a better utilization of the additional binary codes in refining search results.&lt;/li&gt;
&lt;li&gt;With the &lt;code&gt;text-embedding-3-small&lt;/code&gt; model at 512 dimensions, accuracy increases from around 53-55% without rescoring to 71-91% with rescoring, highlighting the significant impact of rescoring, especially at lower dimensions.&lt;/li&gt;
&lt;li&gt;For higher dimension models (such as text-embedding-3-large with 3072 dimensions), 
In contrast, for lower dimension models (such as text-embedding-3-small with 512 dimensions), the incremental accuracy gains from increased oversampling levels are less significant, even with rescoring enabled. This suggests a diminishing return on accuracy improvement with higher oversampling in lower dimension spaces.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Influence of Search Limit&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The performance gain from rescoring seems to be relatively stable across different search limits, suggesting that rescoring consistently enhances accuracy regardless of the number of top results considered.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In summary, enabling rescoring dramatically improves search accuracy across all tested configurations. It is crucial feature for applications where precision is paramount. The consistent performance boost provided by rescoring underscores its value in refining search results, particularly when working with complex, high-dimensional data like OpenAI embeddings. This enhancement is critical for applications that demand high accuracy, such as semantic search, content discovery, and recommendation systems, where the quality of search results directly impacts user experience and satisfaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dataset Combinations
&lt;/h3&gt;

&lt;p&gt;For those exploring the integration of text embedding models with Qdrant, it's crucial to consider various model configurations for optimal performance. The dataset combinations defined above illustrate different configurations to test against Qdrant. These combinations vary by two primary attributes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model Name&lt;/strong&gt;: Signifying the specific text embedding model variant, such as "text-embedding-3-large" or "text-embedding-3-small". This distinction correlates with the model's capacity, with "large" models offering more detailed embeddings at the cost of increased computational resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dimensions&lt;/strong&gt;: This refers to the size of the vector embeddings produced by the model. Options range from 512 to 3072 dimensions. Higher dimensions could lead to more precise embeddings but might also increase the search time and memory usage in Qdrant.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Optimizing these parameters is a balancing act between search accuracy and resource efficiency. Testing across these combinations allows users to identify the configuration that best meets their specific needs, considering the trade-offs between computational resources and the quality of search results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataset_combinations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3072&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Exploring Dataset Combinations and Their Impacts on Model Performance
&lt;/h4&gt;

&lt;p&gt;The code snippet iterates through predefined dataset and model combinations. For each combination, characterized by the model name and its dimensions, the corresponding experiment's results are loaded. These results, which are stored in JSON format, include performance metrics like accuracy under different configurations: with and without oversampling, and with and without a rescore step.&lt;/p&gt;

&lt;p&gt;Following the extraction of these metrics, the code computes the average accuracy across different settings, excluding extreme cases of very low limits (specifically, limits of 1 and 5). This computation groups the results by oversampling, rescore presence, and limit, before calculating the mean accuracy for each subgroup.&lt;/p&gt;

&lt;p&gt;After gathering and processing this data, the average accuracies are organized into a pivot table. This table is indexed by the limit (the number of top results considered), and columns are formed based on combinations of oversampling and rescoring.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;combination&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset_combinations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;combination&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;dimensions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;combination&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, dimensions: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;../results/results-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;average_accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;average_accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;average_accuracy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;average_accuracy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;average_accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;average_accuracy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oversampling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rescore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;average_accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;average_accuracy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;average_accuracy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oversampling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rescore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Impact of Oversampling
&lt;/h4&gt;

&lt;p&gt;You can use oversampling in machine learning to counteract imbalances in datasets.&lt;br&gt;
It works well when one class significantly outnumbers others. This imbalance&lt;br&gt;
can skew the performance of models, which favors the majority class at the&lt;br&gt;
expense of others. By creating additional samples from the minority classes,&lt;br&gt;
oversampling helps equalize the representation of classes in the training dataset, thus enabling more fair and accurate modeling of real-world scenarios.&lt;/p&gt;

&lt;p&gt;The screenshot showcases the effect of oversampling on model performance metrics. While the actual metrics aren't shown, we expect to see improvements in measures such as precision, recall, or F1-score. These improvements illustrate the effectiveness of oversampling in creating a more balanced dataset. It allows the model to learn a better representation of all classes, not just the dominant one.&lt;/p&gt;

&lt;p&gt;Without an explicit code snippet or output, we focus on the role of oversampling in model fairness and performance. Through graphical representation, you can set up before-and-after comparisons. These comparisons illustrate the contribution to machine learning projects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsircyes2fei0l60gp91.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsircyes2fei0l60gp91.png" alt="Image description" width="800" height="498"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Leveraging Binary Quantization: Best Practices
&lt;/h3&gt;

&lt;p&gt;We recommend the following best practices for leveraging Binary Quantization to enhance OpenAI embeddings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Embedding Model: Use the text-embedding-3-large from MTEB. It is most accurate among those tested.&lt;/li&gt;
&lt;li&gt;Dimensions: Use the highest dimension available for the model, to maximize accuracy. The results are true for English and other languages.&lt;/li&gt;
&lt;li&gt;Oversampling: Use an oversampling factor of 3 for the best balance between accuracy and efficiency. This factor is suitable for a wide range of applications.&lt;/li&gt;
&lt;li&gt;Rescoring: Enable rescoring to improve the accuracy of search results.&lt;/li&gt;
&lt;li&gt;RAM: Store the full vectors and payload on disk. Limit what you load from memory to the binary quantization index. This helps reduce the memory footprint and improve the overall efficiency of the system. The incremental latency from the disk read is negligible compared to the latency savings from the binary scoring in Qdrant, which uses SIMD instructions where possible.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Want to discuss these findings and learn more about Binary Quantization? &lt;a href="https://discord.gg/qdrant"&gt;Join our Discord community.&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Learn more about how to boost your vector search speed and accuracy while reducing costs: &lt;a href="https://qdrant.tech/documentation/guides/quantization/?selector=aHRtbCA%2BIGJvZHkgPiBkaXY6bnRoLW9mLXR5cGUoMSkgPiBzZWN0aW9uID4gZGl2ID4gZGl2ID4gZGl2Om50aC1vZi10eXBlKDIpID4gYXJ0aWNsZSA%2BIGgyOm50aC1vZi10eXBlKDIp"&gt;Binary Quantization.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
