DEV Community: Qdrant

Relevance Feedback in Informational Retrieval

Evgeniya Sukhodolskaya — Mon, 31 Mar 2025 21:08:23 +0000

A problem well stated is a problem half solved.

This quote applies as much to life as it does to information retrieval.

With a well-formulated query, retrieving the relevant document becomes trivial. In reality, however, most users struggle to precisely define what they are searching for.

While users may struggle to formulate a perfect request — especially in unfamiliar topics — they can easily judge whether a retrieved answer is relevant or not.

Relevance is a powerful feedback mechanism for a retrieval system to iteratively refine results in the direction of user interest.

In 2025, with social media flooded with daily AI breakthroughs, it almost seems like information retrieval is solved, agents can iteratively adjust their search queries while assessing the relevance.

Of course, there’s a catch: these models still rely on retrieval systems (RAG isn’t dead yet, despite daily predictions of its demise). They receive only a handful of top-ranked results provided by a far simpler and cheaper retriever. As a result, the success of guided retrieval still mainly depends on the retrieval system itself.

So, we should find a way of effectively and efficiently incorporating relevance feedback directly into a retrieval system. In this article, we’ll explore the approaches proposed in the research literature and try to answer the following question:

If relevance feedback in search is so widely studied and praised as effective, why is it practically not used in dedicated vector search solutions?

Dismantling the Relevance Feedback

Both industry and academia tend to reinvent the wheel here and there. So, we first took some time to study and categorize different methods — just in case there was something we could plug directly into Qdrant. The resulting taxonomy isn’t set in stone, but we aim to make it useful.

Types of Relevance Feedback

Pseudo-Relevance Feedback (PRF)

Pseudo-Relevance feedback takes the top-ranked documents from the initial retrieval results and treats them as relevant. This approach might seem naive, but it provides a noticeable performance boost in lexical retrieval while being relatively cheap to compute.

Binary Relevance Feedback

The most straightforward way to gather feedback is to ask users directly if document is relevant. There are two main limitations to this approach:

First, users are notoriously reluctant to provide feedback. Did you know that Google once had an upvote/downvote mechanism on search results but removed it because almost no one used it?

Second, even if users are willing to provide feedback, no relevant documents might be present in the initial retrieval results. In this case, the user can’t provide a meaningful signal.

Instead of asking users, we can ask a smart model to provide binary relevance judgements, but this would limit its potential to generate granular judgements.

Re-scored Relevance Feedback

We can also apply more sophisticated methods to extract relevance feedback from the top-ranked documents - machine learning models can provide a relevance score for each document.

The obvious concern here is twofold:

How accurately can the automated judge determine relevance (or irrelevance)?
How cost-efficient is it? After all, you can’t expect GPT-4o to re-rank thousands of documents for every user query — unless you’re filthy rich. Nevertheless, automated re-scored feedback could be a scalable way to improve search when explicit binary feedback is not accessible.

Has the Problem Already Been Solved?

Digging through research materials, we expected anything else but to discover that the first relevance feedback study dates back sixty years. In the midst of the neural search bubble, it’s easy to forget that lexical (term-based) retrieval has been around for decades. Naturally, research in that field has had enough time to develop.

Neural search — aka vector search — gained traction in the industry around 5 years ago. Hence, vector-specific relevance feedback techniques might still be in their early stages, awaiting production-grade validation and industry adoption.

As a dedicated vector search engine, we would like to be these adopters. Our focus is neural search, but approaches in both lexical and neural retrieval seem worth exploring, as cross-field studies are always insightful, with the potential to reuse well-established methods of one field in another.

We found some interesting methods applicable to neural search solutions and additionally revealed a gap in the neural search-based relevance feedback approaches. Stick around, and we’ll share our findings!

Two Ways to Approach the Problem

Retrieval as a recipe can be broken down into three main ingredients:

Query
Documents
Similarity scoring between them.

Research Field Taxonomy Overview

Query formulation is a subjective process – it can be done in infinite configurations, making the relevance of a document unpredictable until the query is formulated and submitted to the system.

So, adapting documents (or the search index) to relevance feedback would require per-request dynamic changes, which is impractical, considering that modern retrieval systems store billions of documents.

Thus, approaches for incorporating relevance feedback in search fall into two categories: refining a query and refining the similarity scoring function between the query and documents.

Query Refinement

There are several ways to refine a query based on relevance feedback. Globally, we prefer to distinguish between two approaches: modifying the query as text and modifying the vector representation of the query.

Incorporating Relevance Feedback in Query

Query As Text

In term-based retrieval, an intuitive way to improve a query would be to expand it with relevant terms. It resembled the “aha, so that’s what it’s called” stage in the discovery search.

Before the deep learning era of this century, expansion terms were mainly selected using statistical or probabilistic models. The idea was to:

Either extract the most frequent terms from (pseudo-)relevant documents;
Or the most specific ones (for example, according to IDF);
Or the most probable ones (most likely to be in query according to a relevance set).

Well-known methods of those times come from the family of Relevance Models, where terms for expansion are chosen based on their probability in pseudo-relevant documents (how often terms appear) and query terms likelihood given those pseudo-relevant documents - how strongly these pseudo-relevant documents match the query.

The most famous one, RM3 – interpolation of expansion terms probability with their probability in a query – is still appearing in papers of the last few years as a (noticeably decent) baseline in term-based retrieval, usually as part of anserini.

Simplified Query Expansion

With the time approaching the modern machine learning era, multiple studies began claiming that these traditional ways of query expansion are not as effective as they could be.

Started with simple classifiers based on hand-crafted features, this trend naturally led to use the famous BERT (Bidirectional encoder representations from transformers). For example, BERT-QE (Query Expansion) authors came up with this schema:

Get pseudo-relevance feedback from the finetuned BERT reranker (~10 documents);
Chunk these pseudo-relevant documents (~100 words) and score query-chunk relevance with the same reranker;
Expand the query with the most relevant chunks;
Rerank 1000 documents with the reranker using the expanded query.

This approach significantly outperformed BM25 + RM3 baseline in experiments (+11% NDCG@20). However, it required 11.01x more computation than just using BERT for reranking, and reranking 1000 documents with BERT would take around 9 seconds alone.

Query term expansion can hypothetically work for neural retrieval as well. New terms might shift the query vector closer to that of the desired document. However, this approach isn’t guaranteed to succeed. Neural search depends entirely on embeddings, and how those embeddings are generated — consequently, how similar query and document vectors are — depends heavily on the model’s training.

It definitely works if query refining is done by a model operating in the same vector space, which typically requires offline training of a retriever. The goal is to extend the query encoder input to also include feedback documents, producing an adjusted query embedding. Examples include ANCE-PRF and ColBERT-PRF – ANCE and ColBERT fine-tuned extensions.

Generating a new relevance-aware query vector

The reason why you’re most probably not familiar with these models – their absence in the industry – is that their training itself is a high upfront cost, and even though it was “paid”, these models struggle with generalization, performing poorly on out-of-domain tasks (datasets they haven’t seen during training). Additionally, feeding an attention-based model a lengthy input (query + documents) is not a good practice in production settings (attention is quadratic in the input length), where time and money are crucial decision factors.

Alternatively, one could skip a step — and work directly with vectors.

Query As Vector

Instead of modifying the initial query, a more scalable approach is to directly adjust the query vector. It is easily applicable across modalities and suitable for both lexical and neural retrieval.

Although vector search has become a trend in recent years, its core principles have existed in the field for decades. For example, the SMART retrieval system used by Rocchio in 1965 for his relevance feedback experiments operated on bag-of-words vector representations of text.

Roccio’s Relevance Feedback Method

Rocchio’s idea — to update the query vector by adding a difference between the centroids of relevant and non-relevant documents — seems to translate well to modern dual encoders-based dense retrieval systems. Researchers seem to agree: a study from 2022 demonstrated that the parametrized version of Rocchio’s method in dense retrieval consistently improves Recall@1000 by 1–5%, while keeping query processing time suitable for production — around 170 ms.

However, parameters (centroids and query weights) in the dense retrieval version of Roccio’s method must be tuned for each dataset and, ideally, also for each request.

Gradient Descent-Based Methods

The efficient way of doing so on-the-fly remained an open question until the introduction of a gradient-descent-based Roccio’s method generalization: Test-Time Optimization of Query Representations (TOUR). TOUR adapts a query vector over multiple iterations of retrieval and reranking (retrieve → rerank → gradient descent step), guided by a reranker’s relevance judgments.

Figure adapted from Sung et al., 2023, Optimizing Test-Time Query Representations for Dense Retrieval

The next iteration of gradient-based methods of query refinement – ReFit – proposed in 2024 a lighter, production-friendly alternative to TOUR, limiting retrieve → rerank → gradient descent sequence to only one iteration. The retriever’s query vector is updated through matching (via Kullback–Leibler divergence) retriever and cross-encoder’s similarity scores distribution over feedback documents. ReFit is model- and language-independent and stably improves Recall@100 metric on 2–3%.

Gradient descent-based methods seem like a production-viable option, an alternative to finetuning the retriever (distilling it from a reranker). Indeed, it doesn’t require in-advance training and is compatible with any re-ranking models.

However, a few limitations baked into these methods prevented a broader adoption in the industry.

The gradient descent-based methods modify elements of the query vector as if it were model parameters; therefore, they require a substantial amount of feedback documents to converge to a stable solution.

On top of that, the gradient descent-based methods are sensitive to the choice of hyperparameters, leading to query drift, where the query may drift entirely away from the user’s intent.

Similarity Scoring

Another family of approaches is built around the idea of incorporating relevance feedback directly into the similarity scoring function. It might be desirable in cases where we want to preserve the original query intent, but still adjust the similarity score based on relevance feedback.

In lexical retrieval, this can be as simple as boosting documents that share more terms with those judged as relevant.

Its neural search counterpart is a k-nearest neighbors-based method that adjusts the query-document similarity score by adding the sum of similarities between the candidate document and all known relevant examples. This technique yields a significant improvement, around 5.6 percentage points in NDCG@20, but it requires a substantial amount of explicitly labelled feedback documents to be effective.

In experiments, the knn-based method is treated as a reranker. In all other papers, we also found that adjusting similarity scores based on relevance feedback is centred around reranking – training or finetuning rerankers to become relevance feedback-aware. Typically, experiments include cross-encoders, though simple classifiers are also an option. These methods generally involve rescoring a broader set of documents retrieved during an initial search, guided by feedback from a smaller top-ranked subset. It is not a similarity matching function adjustment per se but rather a similarity scoring model adjustment.

Methods typically fall into two categories:

Training rerankers offline to ingest relevance feedback as an additional input at inference time, as here — again, attention-based models and lengthy inputs: a production-deadly combination.
Finetuning rerankers on relevance feedback from the first retrieval stage, as Baumgärtner et al. did, finetuning bias parameters of a small cross-encoder per query on 2,000 feedback documents (still a lot, especially considering that their method works with explicit feedback).

The biggest limitation here is that these reranker-based methods cannot retrieve relevant documents beyond those returned in the initial search, and using rerankers on thousands of documents in production is a no-go – it’s too expensive. Ideally, to avoid that, a similarity scoring function updated with relevance feedback should be used directly in the second retrieval iteration. However, in every research paper we’ve come across, retrieval systems are treated as black boxes — ingesting queries, returning results, and offering no built-in mechanism to modify scoring.

So, what are the takeaways?

Pseudo Relevance Feedback (PRF) is known to improve the effectiveness of lexical retrievers. Several PRF-based approaches – mainly query terms expansion-based – are successfully integrated into traditional retrieval systems. At the same time, there are no known industry-adopted analogues in neural (vector) search dedicated solutions; neural search-compatible methods remain stuck in research papers.

The gap we noticed while studying the field is that researchers have no direct access to retrieval systems, forcing them to design wrappers around the black-box-like retrieval oracles. This is sufficient for query-adjusting methods but not for similarity scoring function adjustment.

Perhaps relevance feedback methods haven’t made it into the neural search systems for trivial reasons — like no one having the time to find the right balance between cost and efficiency.

Getting it to work in a production setting means experimenting, building interfaces, and adapting architectures. Simply put, it needs to look worth it. And unlike 2D vector math, high-dimensional vector spaces are anything but intuitive. The curse of dimensionality is real. So is query drift. Even methods that make perfect sense on paper might not work in practice.

A real-world solution should be simple. Maybe just a little bit smarter than a rule-based approach, but still practical. It shouldn’t require fine-tuning thousands of parameters or feeding paragraphs of text into transformers. And for it to be effective, it needs to be integrated directly into the retrieval system itself.

Introducing Gridstore: Qdrant's Custom Key-Value Store

Luis Cossío — Thu, 06 Feb 2025 12:07:39 +0000

Why We Built Our Own Storage Engine

Databases need a place to store and retrieve data. That’s what Qdrant's key-value storage does—it links keys to values.

When we started building Qdrant, we needed to pick something ready for the task. So we chose RocksDB as our embedded key-value store.

Over time, we ran into issues. Its architecture required compaction (uses LSMT), which caused random latency spikes. It handles generic keys, while we only use it for sequential IDs. Having lots of configuration options makes it versatile, but accurately tuning it was a headache. Finally, interoperating with C++ slowed us down (although we will still support it for quite some time 😭).
While there are already some good options written in Rust that we could leverage, we needed something custom. Nothing out there fit our needs in the way we wanted. We didn’t require generic keys. We wanted full control over when and which data was written and flushed. Our system already has crash recovery mechanisms built-in. Online compaction isn’t a priority, we already have optimizers for that. Debugging misconfigurations was not a great use of our time.
So we built our own storage. As of Qdrant Version 1.13, we are using Gridstore for payload and sparse vector storages.

In this article, you’ll learn about:

How Gridstore works – a deep dive into its architecture and mechanics.
Why we built it this way – the key design decisions that shaped it.
Rigorous testing – how we ensured the new storage is production-ready.
Performance benchmarks – official metrics that demonstrate its efficiency. Our first challenge? Figuring out the best way to handle sequential keys and variable-sized data.

Gridstore Architecture: Three Main Components

Component	Description
Data Layer	Stores values in fixed-sized blocks and retrieves them using a pointer-based lookup system for efficient access.
Mask Layer	Maintains a bitmask to track block usage, distinguishing between allocated and available blocks.
Gaps Layer	Manages block availability at a higher level, optimizing space allocation and reuse.

1. The Data Layer for Fast Retrieval

At the core of Gridstore is The Data Layer, which is designed to store and retrieve values quickly based on their keys. This layer allows us to do efficient reads and lets us store variable-sized data. The main two components of this layer are The Tracker and The Data Grid.
Since internal IDs are always sequential integers (0, 1, 2, 3, 4, ...), the tracker is an array of pointers, where each pointer tells the system exactly where a value starts and how long it is.

The Data Layer uses an array of pointers to quickly retrieve data.

This makes lookups incredibly fast. For example, finding key 3 is just a matter of jumping to the third position in the tracker, and following the pointer to find the value in the data grid.

However, because values are of variable size, the data itself is stored separately in a grid of fixed-sized blocks, which are grouped into larger page files. The fixed size of each block is usually 128 bytes. When inserting a value, Gridstore allocates one or more consecutive blocks to store it, ensuring that each block only holds data from a single value.

2. The Mask Layer Reuses Space

The Mask Layer helps Gridstore handle updates and deletions without the need for expensive data compaction. Instead of maintaining complex metadata for each block, Gridstore tracks usage with a bitmask, where each bit represents a block, with 1 for used, 0 for free.

The bitmask efficiently tracks block usage.

This makes it easy to determine where new values can be written. When a value is removed, it gets soft-deleted at its pointer, and the corresponding blocks in the bitmask are marked as available. Similarly, when updating a value, the new version is written elsewhere, and the old blocks are freed at the bitmask.
This approach ensures that Gridstore doesn’t waste space. As the storage grows, however, scanning for available blocks in the entire bitmask can become computationally expensive.

3. The Gaps Layer for Effective Updates

To further optimize update handling, Gridstore introduces The Gaps Layer, which provides a higher-level view of block availability.
Instead of scanning the entire bitmask, Gridstore splits the bitmask into regions and keeps track of the largest contiguous free space within each region, known as The Region Gap. By also storing the leading and trailing gaps of each region, the system can efficiently combine multiple regions when needed for storing large values.

The complete architecture of Gridstore

This layered approach allows Gridstore to locate available space quickly, scaling down the work required for scans while keeping memory overhead minimal. With this system, finding storage space for new values requires scanning only a tiny fraction of the total metadata, making updates and insertions highly efficient, even in large segments.
Given the default configuration, the gaps layer is scoped out in a millionth fraction of the actual storage size. This means that for each 1GB of data, the gaps layer only requires scanning 6KB of metadata. With this mechanism, the other operations can be executed in virtually constant-time complexity.

Gridstore in Production: Maintaining Data Integrity

Gridstore’s architecture introduces multiple interdependent structures that must remain in sync to ensure data integrity:

The Data Layer holds the data and associates each key with its location in storage, including page ID, block offset, and the size of its value.
The Mask Layer keeps track of which blocks are occupied and which are free.
The Gaps Layer provides an indexed view of free blocks for efficient space allocation. Every time a new value is inserted or an existing value is updated, all these components need to be modified in a coordinated way.

When Things Break in Real Life

Real-world systems don’t operate in a vacuum. Failures happen: software bugs cause unexpected crashes, memory exhaustion forces processes to terminate, disks fail to persist data reliably, and power losses can interrupt operations at any moment.
The critical question is: what happens if a failure occurs while updating these structures?
If one component is updated but another isn’t, the entire system could become inconsistent. Worse, if an operation is only partially written to disk, it could lead to orphaned data, unusable space, or even data corruption.

Stability Through Idempotency: Recovering With WAL

To guard against these risks, Qdrant relies on a Write-Ahead Log (WAL). Before committing an operation, Qdrant ensures that it is at least recorded in the WAL. If a crash happens before all updates are flushed, the system can safely replay operations from the log.

This recovery mechanism introduces another essential property: idempotence.
The storage system must be designed so that reapplying the same operation after a failure leads to the same final state as if the operation had been applied just once.

The Grand Solution: Lazy Updates

To achieve this, Gridstore completes updates lazily, prioritizing the most critical part of the write: the data itself.

👉 Instead of immediately updating all metadata structures, it writes the new value first while keeping lightweight pending changes in a buffer.
👉 The system only finalizes these updates when explicitly requested, ensuring that a crash never results in marking data as deleted before the update has been safely persisted.
👉 In the worst-case scenario, Gridstore may need to write the same data twice, leading to a minor space overhead, but it will never corrupt the storage by overwriting valid data.

How We Tested the Final Product

First... Model Testing

Gridstore can be tested efficiently using model testing, which compares its behavior to a simple in-memory hash map. Since Gridstore should function like a persisted hash map, this method quickly detects inconsistencies.

The process is straightforward:

Initialize a Gridstore instance and an empty hash map.
Run random operations (put, delete, update) on both.
Verify that results match after each operation.
Compare all keys and values to ensure consistency. This approach provides high test coverage, exposing issues like incorrect persistence or faulty deletions. Running large-scale model tests ensures Gridstore remains reliable in real-world use.

Here is a naive way to generate operations in Rust.

enum Operation {
    Put(PointOffset, Payload),
    Delete(PointOffset),
    Update(PointOffset, Payload),
}
impl Operation {
    fn random(rng: &mut impl Rng, max_point_offset: u32) -> Self {
        let point_offset = rng.random_range(0..=max_point_offset);
        let operation = rng.gen_range(0..3);
        match operation {
            0 => {
                let size_factor = rng.random_range(1..10);
                let payload = random_payload(rng, size_factor);
                Operation::Put(point_offset, payload)
            }
            1 => Operation::Delete(point_offset),
            2 => {
                let size_factor = rng.random_range(1..10);
                let payload = random_payload(rng, size_factor);
                Operation::Update(point_offset, payload)
            }
            _ => unreachable!(),
        }
    }
}

Model testing is a high-value way to catch bugs, especially when your system mimics a well-defined component like a hash map. If your component behaves the same as another one, using model testing brings a lot of value for a bit of effort.

We could have tested against RocksDB, but simplicity matters more. A simple hash map lets us run massive test sequences quickly, exposing issues faster.

For even sharper debugging, Property-Based Testing adds automated test generation and shrinking. It pinpoints failures with minimalized test cases, making bug hunting faster and more effective.

Crash Testing: Can Gridstore Handle the Pressure?

Designing for crash resilience is one thing, and proving it works under stress is another. To push Qdrant’s data integrity to the limit, we built Crasher, a test bench that brutally kills and restarts Qdrant while it handles a heavy update workload.

Crasher runs a loop that continuously writes data, then randomly crashes Qdrant. On each restart, Qdrant replays its Write-Ahead Log (WAL), and we verify if data integrity holds. Possible anomalies include:

Missing data (points, vectors, or payloads)
Corrupt payload values This aggressive yet simple approach has uncovered real-world issues when run for extended periods. While we also use chaos testing for distributed setups, Crasher excels at fast, repeatable failure testing in a local environment.

Testing Gridstore Performance: Benchmarks

To measure the impact of our new storage engine, we used Bustle, a key-value storage benchmarking framework, to compare Gridstore against RocksDB. We tested three workloads:

Workload Type	Operation Distribution
Read-heavy	95% reads
Insert-heavy	80% inserts
Update-heavy	50% updates

The results speak for themselves:

Average latency for all kinds of workloads is lower across the board, particularly for inserts.

This shows a clear boost in performance. As we can see, the investment in Gridstore is paying off.

End-to-End Benchmarking

Now, let’s test the impact on a real Qdrant instance. So far, we’ve only integrated Gridstore for payloads and sparse vectors, but even this partial switch should show noticeable improvements.

For benchmarking, we used our in-house bfb tool to generate a workload. Our configuration:

bfb -n 2000000 --max-id 1000000 \
    --sparse-vectors 0.02 \
    --set-payload \
    --on-disk-payload \
    --dim 1 \
    --sparse-dim 5000 \
    --bool-payloads \
    --keywords 100 \
    --float-payloads true \
    --int-payloads 100000 \
    --text-payloads \
    --text-payload-length 512 \
    --skip-field-indices \
    --jsonl-updates ./rps.jsonl

This benchmark upserts 1 million points twice. Each point has:

A medium to large payload
A tiny dense vector (dense vectors use a different storage type)

- A sparse vector

Additional configuration:

The test we conducted updated payload data separately in another request.
There were no payload indices, which ensured we measured pure ingestion speed.

3. Finally, we gathered request latency metrics for analysis.

We ran this against Qdrant 1.12.6, toggling between the old and new storage backends.

Final Result

Data ingestion is twice as fast and with a smoother throughput — a massive win! 😍

We optimized for speed, and it paid off—but what about storage size?

Gridstore: 2333MB
RocksDB: 2319MB Strictly speaking, RocksDB is slightly smaller, but the difference is negligible compared to the 2x faster ingestion and more stable throughput. A small trade-off for a big performance gain!

Trying Out Gridstore

Gridstore represents a significant advancement in how Qdrant manages its key-value storage needs. It offers great performance and streamlined updates tailored specifically for our use case. We have managed to achieve faster, more reliable data ingestion while maintaining data integrity, even under heavy workloads and unexpected failures. It is already used as a storage backend for on-disk payloads and sparse vectors.

👉 It’s important to note that Gridstore remains tightly integrated with Qdrant and, as such, has not been released as a standalone crate.
Its API is still evolving, and we are focused on refining it within our ecosystem to ensure maximum stability and performance. That said, we recognize the value this innovation could bring to the wider Rust community. In the future, once the API stabilizes and we decouple it enough from Qdrant, we will consider publishing it as a contribution to the community ❤️.

For now, Gridstore continues to drive improvements in Qdrant, demonstrating the benefits of a custom-tailored storage engine designed with modern demands in mind. Stay tuned for further updates and potential community releases as we keep pushing the boundaries of performance and reliability.

Simple, efficient, and designed just for Qdrant.

What is Agentic RAG? Building Agents with Qdrant

Kacper Łukawski — Mon, 25 Nov 2024 18:19:49 +0000

Standard Retrieval Augmented Generation follows a predictable, linear path: receive a query, retrieve relevant documents, and generate a response. In many cases that might be enough to solve a particular problem. In the worst case scenario, your LLM will just decide to not answer the question, because the context does not provide enough information.

On the other hand, we have agents. These systems are given more freedom to act, and can take multiple non-linear steps to achieve a certain goal. There isn't a single definition of what an agent is, but in general, it is an applicationthat uses LLM and usually some tools to communicate with the outside world.

LLMs are used as decision-makers which decide what action to take next. Actions can be anything, but they are usually well-defined and limited to a certain set of possibilities. One of these actions might be to query a vector database, like Qdrant, to retrieve relevant documents, if the context is not enough to make a decision.

However, RAG is just a single tool in the agent's arsenal.

Agentic RAG: Combining RAG with Agents

Since the agent definition is vague, the concept of Agentic RAG is also not well-defined. In general, it refers to the combination of RAG with agents. This allows the agent to use external knowledge sources to make decisions, and primarily to decide when the external knowledge is needed.

We can describe a system as Agentic RAG if it breaks the
linear flow of a standard RAG system, and gives the agent the ability to take multiple steps to achieve a goal.

A simple router that chooses a path to follow is often described as the simplest form of an agent. Such a system has multiple paths with conditions describing when to take a certain path. In the context of Agentic RAG, the agent can decide to query a vector database if the context is not enough to answer, or skip the query if it's enough, or when the question refers to common knowledge.

Alternatively, there might be multiple collections storing different kinds of information, and the agent can decide which collection to query based on the context. The key factor is that the decision of choosing a path is made by the LLM, which is the core of the agent. A routing agent never comes back to the previous step, so it's ultimately just a conditional decision-making system.

However, routing is just the beginning. Agents can be much more complex, and extreme forms of agents can have completefreedom to act. In such cases, the agent is given a set of tools and can autonomously decide which ones to use, how to use them, and in which order. LLMs are asked to plan and execute actions, and the agent can take multiple steps to achieve a goal, including taking steps back if needed. Such a system does not have to follow a DAG structure (Directed Acyclic Graph), and can have loops that help to self-correct the decisions made in the past.

An agentic RAG system built in that manner can have tools not only to query a vector database, but also to play with the query, summarize the
results, or even generate new data to answer the question. Options are endless, but there are some common patterns that can be observed in the wild.

Solving Information Retrieval Problems with LLMs

Generally speaking, tools exposed in an agentic RAG system are used to solve information retrieval problems which are not new to the search community. LLMs have changed how we approach these problems, but the core of the problem remains the same. What kind of tools you can consider using in an agentic RAG? Here are some examples:

Querying a vector database - the most common tool used in agentic RAG systems. It allows the agent to retrieve relevant documents based on the query.
Query expansion - a tool that can be used to improve the query. It can be used to add synonyms, correct typos, or even to generate new queries based on the original one.

Extracting filters - vector search alone is sometimes not enough. In many cases, you might want to narrow down the results based on specific parameters. This extraction process can automatically identify relevant conditions from the query. Otherwise, your users would have to manually define these search constraints.

Quality judgement - knowing the quality of the results for given query can be used to decide whether they are good enough to answer, or if the agent should take another step to improve them somehow. Alternatively it can also admit the failure to provide good response.

These are just some of the examples, but the list is not exhaustive. For example, your LLM could possibly play with Qdrant search parameters or choose different methods to query it. An example? If your users are searching using some specific keywords, you may prefer sparse vectors to dense vectors, as they are more efficient in such cases. In that case you have to arm your agent with tools to decide when to use sparse vectors and when to use dense vectors. Agentaware of the collection structure can make such decisions easily.

Each of these tools might be a separate agent on its own, and multi-agent systems are not uncommon. In such cases, agents can communicate with each other, and one agent can decide to use another agent to solve a particular problem.

Pretty useful component of an agentic RAG is also a human in the loop, which can be used to correct the agent's decisions, or steer it in the right direction.

Where are Agents Used?

Agents are an interesting concept, but since they heavily rely on LLMs, they are not applicable to all problems. Using Large Language Models is expensive and tend to be slow, what in many cases, it's not worth the cost. Standard RAG involves just a single call to the LLM, and the response is generated in a predictable way. Agents, on the other hand, can take multiple steps, and the latency experienced by the user adds up.

In many cases, it's not acceptable. Agentic RAG is probably not that widely applicable in ecommerce search, where the user expects a quick response, but might be fine for customer support, where the user is willing to wait a bit longer for a better answer.

Which Framework is Best?

There are lots of frameworks available to build agents, and choosing the best one is not easy. It depends on your existing stack or the tools you are familiar with. Some of the most popular LLM libraries have already drifted towards the agent paradigm, and they are offering tools to build them. There are, however, some tools built primarily for agents development, so let's focus on them.

LangGraph

Developed by the LangChain team, LangGraph seems like a natural extension for those who already use LangChain for building their RAG systems, and would like to start with agentic RAG.

Surprisingly, LangGraph has nothing to do with Large Language Models on its own. It's a framework for building graph-based applications in which each node is a step of the workflow. Each node takes an application state as an input, and produces a modified state as an output. The state is then passed to the next node, and so on. Edges between the nodes might be conditional what makes branching possible. Contrary to some DAG-based tool (i.e. Apache Airflow), LangGraph allows for loops in the graph, which makes it possible to implement cyclic workflows, so an agent can achieve self-reflection and self-correction. Theoretically, LangGraph can be used to build any kind of applications in a graph-based manner, not only LLM agents.

Some of the strengths of LangGraph include:

Persistence - the state of the workflow graph is stored as a checkpoint. That happens at each so-called super-step (which is a single sequential node of a graph). It enables replying certain steps of the workflow, fault-tolerance, and including human-in-the-loop interactions. This mechanism also acts as a short-term memory, accessible in a context of a particular workflow execution.
Long-term memory - LangGraph also has a concept of memories that are shared between different workflow runs. However, this mechanism has to explicitly handled by our nodes. Qdrant with its semantic search capabilities is often used as a long-term memory layer.
Multi-agent support - while there is no separate concept of multi-agent systems in LangGraph, it's possible to create such an architecture by building a graph that includes multiple agents and some kind of supervisor that makes a decision which agent to use in a given situation. If a node might be anything, then it might be another agent as well.

Some other interesting features of LangGraph include the ability to visualize the graph, automate the retries of failed steps, and include human-in-the-loop interactions.

A minimal example of an agentic RAG could improve the user query, e.g. by fixing typos, expanding it with synonyms, or even generating a new query based on the original one. The agent could then retrieve documents from a vector database based on the improved query, and generate a response. The LangGraph app implementing this approach could look like this:

from typing import Sequence
from typing_extensions import TypedDict, Annotated
from langchain_core.messages import BaseMessage
from langgraph.constants import START, END
from langgraph.graph import add_messages, StateGraph


class AgentState(TypedDict):
    # The state of the agent includes at least the messages exchanged between the agent(s) 
    # and the user. It is, however, possible to include other information in the state, as 
    # it depends on the specific agent.
    messages: Annotated[Sequence[BaseMessage], add_messages]


def improve_query(state: AgentState):
    ...

def retrieve_documents(state: AgentState):
    ...

def generate_response(state: AgentState):
    ...

# Building a graph requires defining nodes and building the flow between them with edges.
builder = StateGraph(AgentState)

builder.add_node("improve_query", improve_query)
builder.add_node("retrieve_documents", retrieve_documents)
builder.add_node("generate_response", generate_response)

builder.add_edge(START, "improve_query")
builder.add_edge("improve_query", "retrieve_documents")
builder.add_edge("retrieve_documents", "generate_response")
builder.add_edge("generate_response", END)

# Compiling the graph performs some checks and prepares the graph for execution.
compiled_graph = builder.compile()

# Compiled graph might be invoked with the initial state to start.
compiled_graph.invoke({
    "messages": [
        ("user", "Why Qdrant is the best vector database out there?"),
    ]
})

Each node of the process is just a Python function that does certain operation. You can call an LLM of your choice inside of them, if you want to, but there is no assumption about the messages being created by any AI. LangGraph rather acts as a runtime that launches these functions in a specific order, and passes the state between them.

While LangGraph integrates well with the LangChain ecosystem, it can be used independently. For teams looking for additional support and features, there's also a commercial offering called LangGraph Platform. The framework is available for both Python and JavaScript environments, making it possible to be used in different tech stacks.

CrewAI

CrewAI is another popular choice for building agents, including agentic RAG. It's a high-level framework that assumesthere are some LLM-based agents working together to achieve a common goal. That's where the "crew" in CrewAI comes from. CrewAI is designed with multi-agent systems in mind. Contrary to LangGraph, the developer does not create a graph of
processing, but defines agents and their roles within the crew.

Some of the key concepts of CrewAI include:

Agent - a unit that has a specific role and goal, controlled by an LLM. It can optionally use some external tools to communicate with the outside world, but generally steered by prompt we provide to the LLM.
Process - currently either sequential or hierarchical. It defines how the task will be executed by the agents. In a sequential process, agents are executed one after another, while in a hierarchical process, agent is selected by the manager agent, which is responsible for making decisions about which agent to use in a given situation.
Roles and goals - each agent has a certain role within the crew, and the goal it should aim to achieve. These are set when we define an agent and are used to make decisions about which agent to use in a given situation.
Memory - an extensive memory system consists of short-term memory, long-term memory, entity memory, and contextual memory that combines the other three. There is also user memory for preferences and personalization. This is where Qdrant comes into play, as it might be used as a long-term memory layer.

CrewAI provides a rich set of tools integrated into the framework. That may be a huge advantage for those who want to combine RAG with e.g. code execution, or image generation. The ecosystem is rich, however brining your own tools is not a big deal, as CrewAI is designed to be extensible.

A simple agentic RAG application implemented in CrewAI could look like this:

from crewai import Crew, Agent, Task
from crewai.memory.entity.entity_memory import EntityMemory
from crewai.memory.short_term.short_term_memory import ShortTermMemory
from crewai.memory.storage.rag_storage import RAGStorage

class QdrantStorage(RAGStorage):
    ...

response_generator_agent = Agent(
    role="Generate response based on the conversation",
    goal="Provide the best response, or admit when the response is not available.",
    backstory=(
        "I am a response generator agent. I generate "
        "responses based on the conversation."
    ),
    verbose=True,
)

query_reformulation_agent = Agent(
    role="Reformulate the query",
    goal="Rewrite the query to get better results. Fix typos, grammar, word choice, etc.",
    backstory=(
        "I am a query reformulation agent. I reformulate the " 
        "query to get better results."
    ),
    verbose=True,
)

task = Task(
    description="Let me know why Qdrant is the best vector database out there.",
    expected_output="3 bullet points",
    agent=response_generator_agent,
)

crew = Crew(
    agents=[response_generator_agent, query_reformulation_agent],
    tasks=[task],
    memory=True,
    entity_memory=EntityMemory(storage=QdrantStorage("entity")),
    short_term_memory=ShortTermMemory(storage=QdrantStorage("short-term")),
)
crew.kickoff()

Disclaimer: QdrantStorage is not a part of the CrewAI framework, but it's taken from the Qdrant documentation on how to integrate Qdrant with CrewAI.

Although it's not a technical advantage, CrewAI has a great documentation. The framework is available for Python, and it's easy to get started with it. CrewAI also has a commercial offering, CrewAI Enterprise, which provides a platform for building and deploying agents at scale.

AutoGen

AutoGen emphasizes multi-agent architectures as a fundamental design principle. The framework requires at least two agents in any system to really call an application agentic - typically an assistant and a user proxy exchange messages to achieve a common goal. Sequential chat with more than two agents is also supported, as well as group chat and nested
chat for internal dialogue. However, AutoGen does not assume there is a structured state that is passed between the agents, and the chat conversation is the only way to communicate between them.

There are many interesting concepts in the framework, some of them even quite unique:

Tools/functions - external components that can be used by agents to communicate with the outside world. They are defined as Python callables, and can be used for any external interaction we want to allow the agent to do. Type annotations are used to define the input and output of the tools, and Pydantic models are supported for more complex type schema. AutoGen supports only OpenAI-compatible tool call API for the time being.
Code executors - built-in code executors include local command, Docker command, and Jupyter. An agent can write and launch code, so theoretically the agents can do anything that can be done in Python. None of the other frameworks made code generation and execution that prominent. Code execution being the first-class citizen in AutoGen is an interesting concept.

Each AutoGen agent uses at least one of the components: human-in-the-loop, code executor, tool executor, or LLM. A simple agentic RAG, based on the conversation of two agents which can retrieve documents from a vector database, or improve the query, could look like this:

from os import environ

from autogen import ConversableAgent
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
from qdrant_client import QdrantClient

client = QdrantClient(...)

response_generator_agent = ConversableAgent(
    name="response_generator_agent",
    system_message=(
        "You answer user questions based solely on the provided context. You ask to retrieve relevant documents for "
        "your query, or reformulate the query, if it is incorrect in some way."
    ),
    description="A response generator agent that can answer your queries.",
    llm_config={"config_list": [{"model": "gpt-4", "api_key": environ.get("OPENAI_API_KEY")}]},
    human_input_mode="NEVER",
)

user_proxy = RetrieveUserProxyAgent(
    name="retrieval_user",
    llm_config={"config_list": [{"model": "gpt-4", "api_key": environ.get("OPENAI_API_KEY")}]},
    human_input_mode="NEVER",
    retrieve_config={
        "task": "qa",
        "chunk_token_size": 2000,
        "vector_db": "qdrant",
        "db_config": {"client": client},
        "get_or_create": True,
        "overwrite": True,
    },
)

result = user_proxy.initiate_chat(
    response_generator_agent,
    message=user_proxy.message_generator,
    problem="Why Qdrant is the best vector database out there?",
    max_turns=10,
)

For those new to agent development, AutoGen offers AutoGen Studio, a low-code interface for prototyping agents. While not intended for production use, it significantly lowers the barrier to entry for experimenting with agent architectures.

It's worth noting that AutoGen is currently undergoing significant updates, with version 0.4.x in development introducing substantial API changes compared to the stable 0.2.x release. While the framework currently has limited built-in persistence and state management capabilities, these features may evolve in future releases.

OpenAI Swarm

Unliked the other frameworks described in this article, OpenAI Swarm is an educational project, and it's not ready for production use. It's worth mentioning, though, as it's pretty lightweight and easy to get started with. OpenAI Swarm is an experimental framework for orchestrating multi-agent workflows that focuses on agent coordination through direct handoffs rather than complex orchestration patterns.

With that setup, agents are just exchanging messages in a chat, optionally calling some Python functions to communicate with external services, or handing off the conversation to another agent, if the other one seems to be more suitable to answer the question. Each agent has a certain role, defined by the instructions we have to define.

We have to decide which LLM will a particular agent use, and a set of functions it can call. For example, a retrieval agent could use a vector database to retrieve documents, and return the results to the next agent. That means, there should be a function that performs the semantic search on its behalf, but the model will decide how the query should look like.

Here is how a similar agentic RAG application, implemented in OpenAI Swarm, could look like:

from swarm import Swarm, Agent

client = Swarm()

def retrieve_documents(query: str) -> list[str]:
    """
    Retrieve documents based on the query.
    """
    ...

def transfer_to_query_improve_agent():
    return query_improve_agent

query_improve_agent = Agent(
    name="Query Improve Agent",
    instructions=(
        "You are a search expert that takes user queries and improves them to get better results. You fix typos and "
        "extend queries with synonyms, if needed. You never ask the user for more information."
    ),
)

response_generation_agent = Agent(
    name="Response Generation Agent",
    instructions=(
        "You take the whole conversation and generate a final response based on the chat history. "
        "If you don't have enough information, you can retrieve the documents from the knowledge base or "
        "reformulate the query by transferring to other agent. You never ask the user for more information. "
        "You have to always be the last participant of each conversation."
    ),
    functions=[retrieve_documents, transfer_to_query_improve_agent],
)

response = client.run(
    agent=response_generation_agent,
    messages=[
        {
            "role": "user",
            "content": "Why Qdrant is the best vector database out there?"
        }
    ],
)

Even though we don't explicitly define the graph of processing, the agents can still decide to hand off the processing to a different agent. There is no concept of a state, so everything relies on the messages exchanged between different components.

OpenAI Swarm does not focus on integration with external tools, and if you would like to integrate semantic search with Qdrant, you would have to implement it fully yourself. Obviously, the library is tightly coupled with OpenAI models, and while using some other ones is possible, it requires some additional work like setting up proxy that will
adjust the interface to OpenAI API.

The winner?

Choosing the best framework for your agentic RAG system depends on your existing stack, team expertise, and the specific requirements of your project. All the described tools are strong contenders, and they are developed at rapid pace. It's worth keeping an eye on all of them, as they are likely to evolve and improve over time. Eventually, you should be able to build the same processes with any of them, but some of them may be more suitable in a specific ecosystem of the tools you want your agent to interact with.

There are, however, some important factors to consider when choosing a framework for your agentic RAG system:

Human-in-the-loop - even though we aim to build autonomous agents, it's often important to include the feedback from the human, so our agents cannot perform malicious actions.
Observability - how easy it is to debug the system, and how easy it is to understand what's happening inside. Especially important, since we are dealing with lots of LLM prompts.

Still, choosing the right toolkit depends on the state of your project, and the specific requirements you have. If you want to integrate your agent with number of external tools, CrewAI might be the best choice, as the set of out-of-the-box integrations is the biggest. However, LangGraph integrates well with LangChain, so if you are familiar with that ecosystem, it may suit you better.

All the frameworks have different approaches to building agents, so it's worth experimenting with all of them to see which one fits your needs the best. LangGraph and CrewAI are more mature and have more features, while AutoGen and OpenAI Swarm are more lightweight and more experimental. However, none of the existing frameworks solves all the mentioned Information Retrieval problems, so you still have to build your own tools to fill the gaps.

Building Agentic RAG with Qdrant

No matter which framework you choose, Qdrant is a great tool to build agentic RAG systems. Please check out our integrations to choose the best one for your use case and preferences. The easiest way to start using Qdrant is to use our managed service, Qdrant Cloud. A free 1GB cluster is
available for free, so you can start building your agentic RAG system in minutes.

What is Vector Quantization?

Sabrina — Fri, 27 Sep 2024 16:14:21 +0000

Vector quantization is a data compression technique used to reduce the size of high-dimensional data. Compressing vectors reduces memory usage while maintaining nearly all of the essential information. This method allows for more efficient storage and faster search operations, particularly in large datasets.

When working with high-dimensional vectors, such as embeddings from providers like OpenAI, a single 1536-dimensional vector requires 6 KB of memory.

With 1 million vectors needing around 6 GB of memory, as your dataset grows to multiple millions of vectors, the memory and processing demands increase significantly.

To understand why this process is so computationally demanding, let's take a look at the nature of the HNSW index.

The HNSW (Hierarchical Navigable Small World) index organizes vectors in a layered graph, connecting each vector to its nearest neighbors. At each layer, the algorithm narrows down the search area until it reaches the lower layers, where it efficiently finds the closest matches to the query.

Each time a new vector is added, the system must determine its position in the existing graph, a process similar to searching. This makes both inserting and searching for vectors complex operations.

One of the key challenges with the HNSW index is that it requires a lot of random reads and sequential traversals through the graph. This makes the process computationally expensive, especially when you're dealing with millions of high-dimensional vectors.

The system has to jump between various points in the graph in an unpredictable manner. This unpredictability makes optimization difficult, and as the dataset grows, the memory and processing requirements increase significantly.

Since vectors need to be stored in fast storage like RAM or SSD for low-latency searches, as the size of the data grows, so does the cost of storing and processing it efficiently.

Quantization offers a solution by compressing vectors to smaller memory sizes, making the process more efficient.

There are several methods to achieve this, and here we will focus on three main ones:

1. What is Scalar Quantization?

In Qdrant, each dimension is represented by a float32 value, which uses 4 bytes of memory. When using Scalar Quantization, we map our vectors to a range that the smaller int8 type can represent. An int8 is only 1 byte and can represent 256 values (from -128 to 127, or 0 to 255). This results in a 75% reduction in memory size.

For example, if our data lies in the range of -1.0 to 1.0, Scalar Quantization will transform these values to a range that int8 can represent, i.e., within -128 to 127. The system maps the float32 values into this range.

Here's a simple linear example of what this process looks like:

To set up Scalar Quantization in Qdrant, you need to include the quantization_config section when creating or updating a collection:

PUT /collections/{collection_name}
{
    "vectors": {
      "size": 128,
      "distance": "Cosine"
    },
    "quantization_config": {
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": true
        }
    }
}

The quantile parameter is used to calculate the quantization bounds. For example, if you specify a 0.99 quantile, the most extreme 1% of values will be excluded from the quantization bounds.

This parameter only affects the resulting precision, not the memory footprint. You can adjust it if you experience a significant decrease in search quality.

Scalar Quantization is a great choice if you're looking to boost search speed and compression without losing much accuracy. It also slightly improves performance, as distance calculations (such as dot product or cosine similarity) using int8 values are computationally simpler than using float32 values.

While the performance gains of Scalar Quantization may not match those achieved with Binary Quantization (which we'll discuss later), it remains an excellent default choice when Binary Quantization isn’t suitable for your use case.

2. What is Binary Quantization?

Binary Quantization is an excellent option if you're looking to reduce memory usage while also achieving a significant boost in speed. It works by converting high-dimensional vectors into simple binary (0 or 1) representations.

Values greater than zero are converted to 1.
Values less than or equal to zero are converted to 0.

Let's consider our initial example of a 1536-dimensional vector that requires 6 KB of memory (4 bytes for each float32 value).

After Binary Quantization, each dimension is reduced to 1 bit (1/8 byte), so the memory required is:

\frac{1536 \text{ dimensions}}{8 \text{ bits per byte}} = 192 \text{ bytes}

This leads to a 32x memory reduction.

Qdrant automates the Binary Quantization process during indexing. As vectors are added to your collection, each 32-bit floating-point component is converted into a binary value according to the configuration you define.

Here’s how you can set it up:

PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "quantization_config": {
        "binary": {
            "always_ram": true
        }
    }
}

Binary Quantization is by far the quantization method that provides the most significant processing speed gains compared to Scalar and Product Quantizations. This is because the binary representation allows the system to use highly optimized CPU instructions, such as XOR and Popcount, for fast distance computations.

It can speed up search operations by up to 40x, depending on the dataset and hardware.

Not all models are equally compatible with Binary Quantization, and in the comparison above, we are only using models that are compatible. Some models may experience a greater loss in accuracy when quantized. We recommend using Binary Quantization with models that have at least 1024 dimensions to minimize accuracy loss.

The models that have shown the best compatibility with this method include:

OpenAI text-embedding-ada-002 (1536 dimensions)
Cohere AI embed-english-v2.0 (4096 dimensions)

These models demonstrate minimal accuracy loss while still benefiting from substantial speed and memory gains.

Even though Binary Quantization is incredibly fast and memory-efficient, the trade-offs are in precision and model compatibility, so you may need to ensure search quality using techniques like oversampling and rescoring.

If you're interested in exploring Binary Quantization in more detail—including implementation examples, benchmark results, and usage recommendations—check out our dedicated article on Binary Quantization - Vector Search, 40x Faster.

3. What is Product Quantization?

Product Quantization is a method used to compress high-dimensional vectors by representing them with a smaller set of representative points.

The process begins by splitting the original high-dimensional vectors into smaller sub-vectors. Each sub-vector represents a segment of the original vector, capturing different characteristics of the data.

For each sub-vector, a separate codebook is created, representing regions in the data space where common patterns occur.

The codebook in Qdrant is trained automatically during the indexing process. As vectors are added to the collection, Qdrant uses your specified quantization settings in the quantization_config to build the codebook and quantize the vectors. Here’s how you can set it up:

PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1024,
      "distance": "Cosine"
    },
    "quantization_config": {
        "product": {
            "compression": "x32",
            "always_ram": true
        }
    }
}

Each region in the codebook is defined by a centroid, which serves as a representative point summarizing the characteristics of that region. Instead of treating every single data point as equally important, we can group similar sub-vectors together and represent them with a single centroid that captures the general characteristics of that group.

The centroids used in Product Quantization are determined using the K-means clustering algorithm.

Qdrant always selects K = 256 as the number of centroids in its implementation, based on the fact that 256 is the maximum number of unique values that can be represented by a single byte.

This makes the compression process efficient because each centroid index can be stored in a single byte.

The original high-dimensional vectors are quantized by mapping each sub-vector to the nearest centroid in its respective codebook.

The compressed vector stores the index of the closest centroid for each sub-vector.

Here’s how a 1024-dimensional vector, originally taking up 4096 bytes, is reduced to just 128 bytes by representing it as 128 indexes, each pointing to the centroid of a sub-vector:

After setting up quantization and adding your vectors, you can perform searches as usual. Qdrant will automatically use the quantized vectors, optimizing both speed and memory usage. Optionally, you can enable rescoring for better accuracy.

POST /collections/{collection_name}/points/search
{
    "query": [0.22, -0.01, -0.98, 0.37],
    "params": {
        "quantization": {
            "rescore": true
        }
    },
    "limit": 10
}

Product Quantization can significantly reduce memory usage, potentially offering up to 64x compression in certain configurations. However, it's important to note that this level of compression can lead to a noticeable drop in quality.

If your application requires high precision or real-time performance, Product Quantization may not be the best choice. However, if memory savings are critical and some accuracy loss is acceptable, it could still be an ideal solution.

Here’s a comparison of speed, accuracy, and compression for all three methods, adapted from Qdrant's documentation:

Quantization method	Accuracy	Speed	Compression
Scalar	0.99	up to x2	4
Product	0.7	0.5	up to 64
Binary	0.95*	up to x40	32

* - for compatible models

For a more in-depth understanding of the benchmarks you can expect, check out our dedicated article on Product Quantization in Vector Search.

Rescoring, Oversampling, and Reranking

When we use quantization methods like Scalar, Binary, or Product Quantization, we're compressing our vectors to save memory and improve performance. However, this compression removes some detail from the original vectors.

This can slightly reduce the accuracy of our similarity searches because the quantized vectors are approximations of the original data. To mitigate this loss of accuracy, you can use oversampling and rescoring, which help improve the accuracy of the final search results.

The original vectors are never deleted during this process, and you can easily switch between quantization methods or parameters by updating the collection configuration at any time.

Here’s how the process works, step by step:

1. Initial Quantized Search

When you perform a search, Qdrant retrieves the top candidates using the quantized vectors based on their similarity to the query vector, as determined by the quantized data. This step is fast because we're using the quantized vectors.

2. Oversampling

Oversampling is a technique that helps compensate for any precision lost due to quantization. Since quantization simplifies vectors, some relevant matches could be missed in the initial search. To avoid this, you can retrieve more candidates, increasing the chances that the most relevant vectors make it into the final results.

You can control the number of extra candidates by setting an oversampling parameter. For example, if your desired number of results (limit) is 4 and you set an oversampling factor of 2, Qdrant will retrieve 8 candidates (4 × 2).

You can adjust the oversampling factor to control how many extra vectors Qdrant includes in the initial pool. More candidates mean a better chance of obtaining high-quality top-K results, especially after rescoring with the original vectors.

3. Rescoring with Original Vectors

After oversampling to gather more potential matches, each candidate is re-evaluated based on additional criteria to ensure higher accuracy and relevance to the query.

The rescoring process maps the quantized vectors to their corresponding original vectors, allowing you to consider factors like context, metadata, or additional relevance that wasn’t included in the initial search, leading to more accurate results.

During rescoring, one of the lower-ranked candidates from oversampling might turn out to be a better match than some of the original top-K candidates.

Even though rescoring uses the original, larger vectors, the process remains much faster because only a very small number of vectors are read. The initial quantized search already identifies the specific vectors to read, rescore, and rerank.

4. Reranking

With the new similarity scores from rescoring, reranking is where the final top-K candidates are determined based on the updated similarity scores.

For example, in our case with a limit of 4, a candidate that ranked 6th in the initial quantized search might improve its score after rescoring because the original vectors capture more context or metadata. As a result, this candidate could move into the final top 4 after reranking, replacing a less relevant option from the initial search.

Here's how you can set it up:

POST /collections/{collection_name}/points/search
{
  "query": [0.22, -0.01, -0.98, 0.37],
  "params": {
    "quantization": {
      "rescore": true,
      "oversampling": 2
    }
  },
  "limit": 4
}

You can adjust the oversampling factor to find the right balance between search speed and result accuracy.

If quantization is impacting performance in an application that requires high accuracy, combining oversampling with rescoring is a great choice. However, if you need faster searches and can tolerate some loss in accuracy, you might choose to use oversampling without rescoring, or adjust the oversampling factor to a lower value.

Distributing Resources Between Disk & Memory

Qdrant stores both the quantized and original vectors. When you enable quantization, both the original and quantized vectors are stored in RAM by default. You can move the original vectors to disk to significantly reduce RAM usage and lower system costs. Simply enabling quantization is not enough—you need to explicitly move the original vectors to disk by setting on_disk=True.

Here’s an example configuration:

PUT /collections/{collection_name}
{
  "vectors": {
    "size": 1536,
    "distance": "Cosine",
    "on_disk": true  # Move original vectors to disk
  },
  "quantization_config": {
    "binary": {
      "always_ram": true  # Store only quantized vectors in RAM
    }
  }
}

Without explicitly setting on_disk=True, you won't see any RAM savings, even with quantization enabled. So, make sure to configure both storage and quantization options based on your memory and performance needs. If your storage has high disk latency, consider disabling rescoring to maintain speed.

Speeding Up Rescoring with io_uring

When dealing with large collections of quantized vectors, frequent disk reads are required to retrieve both original and compressed data for rescoring operations. While mmap helps with efficient I/O by reducing user-to-kernel transitions, rescoring can still be slowed down when working with large datasets on disk due to the need for frequent disk reads.

On Linux-based systems, io_uring allows multiple disk operations to be processed in parallel, significantly reducing I/O overhead. This optimization is particularly effective during rescoring, where multiple vectors need to be re-evaluated after the initial search. With io_uring, Qdrant can retrieve and rescore vectors from disk in the most efficient way, improving overall search performance.

When you perform vector quantization and store data on disk, Qdrant often needs to access multiple vectors in parallel. Without io_uring, this process can be slowed down due to the system’s limitations in handling many disk accesses.

To enable io_uring in Qdrant, add the following to your storage configuration:

storage:
  async_scorer: true  # Enable io_uring for async storage

Without this configuration, Qdrant will default to using mmap for disk I/O operations.

For more information and benchmarks comparing io_uring with traditional I/O approaches like mmap, check out Qdrant's io_uring implementation article.

Performance of Quantized vs. Non-Quantized Data

Qdrant uses the quantized vectors by default if they are available. If you want to evaluate how quantization affects your search results, you can temporarily disable it to compare results from quantized and non-quantized searches. To do this, set ignore: true in the query:

POST /collections/{collection_name}/points/query
{
    "query": [0.22, -0.01, -0.98, 0.37],
    "params": {
        "quantization": {
            "ignore": true,
        }
    },
    "limit": 4
}

Switching Between Quantization Methods

Not sure if you’ve chosen the right quantization method? In Qdrant, you have the flexibility to remove quantization and rely solely on the original vectors, adjust the quantization type, or change compression parameters at any time without affecting your original vectors.

To switch to binary quantization and adjust the compression rate, for example, you can update the collection’s quantization configuration using the update_collection method:

PUT /collections/{collection_name}
{
  "vectors": {
    "size": 1536,
    "distance": "Cosine"
  },
  "quantization_config": {
    "binary": {
      "always_ram": true,
      "compression_rate": 0.8  # Set the new compression rate
    }
  }
}

If you decide to turn off quantization and use only the original vectors, you can remove the quantization settings entirely with quantization_config=None:

PUT /collections/my_collection
{
  "vectors": {
    "size": 1536,
    "distance": "Cosine"
  },
  "quantization_config": null  # Remove quantization and use original vectors only
}

Wrapping Up

Quantization methods like Scalar, Product, and Binary Quantization offer powerful ways to optimize memory usage and improve search performance when dealing with large datasets of high-dimensional vectors. Each method comes with its own trade-offs between memory savings, computational speed, and accuracy.

Here are some final thoughts to help you choose the right quantization method for your needs:

Quantization Method	Key Features	When to Use
Binary Quantization	• Fastest method and most memory-efficient • Up to 40x faster search and 32x reduced memory footprint	• Use with tested models like OpenAI's `text-embedding-ada-002` and Cohere's `embed-english-v2.0` • When speed and memory efficiency are critical
Scalar Quantization	• Minimal loss of accuracy • Up to 4x reduced memory footprint	• Safe default choice for most applications. • Offers a good balance between accuracy, speed, and compression.
Product Quantization	• Highest compression ratio • Up to 64x reduced memory footprint	• When minimizing memory usage is the top priority • Acceptable if some loss of accuracy and slower indexing is tolerable

Learn More

If you want to learn more about improving accuracy, memory efficiency, and speed when using quantization in Qdrant, we have a dedicated Quantization tips section in our docs that explains all the quantization tips you can use to enhance your results.

Learn more about optimizing real-time precision with oversampling in Binary Quantization by watching this interview with Qdrant’s CTO, Andrey Vasnetsov:

Stay up-to-date on the latest in vector search and quantization, share your projects, ask questions, join our vector search community!

A Complete Guide to Filtering in Vector Search

Sabrina — Thu, 12 Sep 2024 14:45:10 +0000

Imagine you sell computer hardware. To help shoppers easily find products on your website, you need to have a user-friendly search engine.

If you’re selling computers and have extensive data on laptops, desktops, and accessories, your search feature should guide customers to the exact device they want - or a very similar match needed.

When storing data in Qdrant, each product is a point, consisting of an id, a vector and payload:

{
  "id": 1, 
  "vector": [0.1, 0.2, 0.3, 0.4],
  "payload": {
    "price": 899.99,
    "category": "laptop"
  }
}

The id is a unique identifier for the point in your collection. The vector is a mathematical representation of similarity to other points in the collection.
Finally, the payload holds metadata that directly describes the point.

Though we may not be able to decipher the vector, we are able to derive additional information about the item from its metadata, In this specific case, we are looking at a data point for a laptop that costs $899.99.

What is filtering?

When searching for the perfect computer, your customers may end up with results that are mathematically similar to the search entry, but not exact. For example, if they are searching for laptops under $1000, a simple vector search without constraints might still show other laptops over $1000.

This is why semantic search alone may not be enough. In order to get the exact result, you would need to enforce a payload filter on the price. Only then can you be sure that the search results abide by the chosen characteristic.

This is called filtering and it is one of the key features of vector databases.
Here is how a filtered vector search looks behind the scenes. We'll cover its mechanics in the following section.

POST /collections/online_store/points/search
{
  "vector": [ 0.2, 0.1, 0.9, 0.7 ],
  "filter": {
    "must": [
      {
        "key": "category",
        "match": { "value": "laptop" }
      },
      {
        "key": "price",
        "range": {
          "gt": null,
          "gte": null,
          "lt": null,
          "lte": 1000
        }
      }
    ]
  },
  "limit": 3,
  "with_payload": true,
  "with_vector": false
}

The filtered result will be a combination of the semantic search and the filtering conditions imposed upon the query. In the following pages, we will show that filtering is a key practice in vector search for two reasons:

With filtering, you can dramatically increase search precision. More on this in the next section.
Filtering helps control resources and reduce compute use. More on this in Payload Indexing.

What you will learn in this guide:

In vector search, filtering and sorting are more interdependent than they are in traditional databases. While databases like SQL use commands such as WHERE and ORDER BY, the interplay between these processes in vector search is a bit more complex.

Most people use default settings and build vector search apps that aren't properly configured or even setup for precise retrieval. In this guide, we will show you how to use filtering to get the most out of vector search with some basic and advanced strategies that are easy to implement.

Remember to run all tutorial code in Qdrant's Dashboard

The easiest way to reach that "Hello World" moment is to try filtering in a live cluster. Our interactive tutorial will show you how to create a cluster, add data and try some filtering clauses.

Qdrant's approach to filtering

Qdrant follows a specific method of searching and filtering through dense vectors.

Let's take a look at this 3-stage diagram. In this case, we are trying to find the nearest neighbour to the query vector (green). Your search journey starts at the bottom (orange).

By default, Qdrant connects all your data points within the vector index. After you introduce filters, some data points become disconnected. Vector search can't cross the grayed out area and it won't reach the nearest neighbor.
How can we bridge this gap?

Figure 1: How Qdrant maintains a filterable vector index.

Filterable vector index: This technique builds additional links (orange) between leftover data points. The filtered points which stay behind are now traversible once again. Qdrant uses special category-based methods to connect these data points.

Qdrant's approach vs traditional filtering methods

The filterable vector index is Qdrant's solves pre and post-filtering problems by adding specialized links to the search graph. It aims to maintain the speed advantages of vector search while allowing for precise filtering, addressing the inefficiencies that can occur when applying filters after the vector search.

Pre-filtering

In pre-filtering, a search engine first narrows down the dataset based on chosen metadata values, and then searches within that filtered subset. This reduces unnecessary computation over a dataset that is potentially much larger.

The choice between pre-filtering and using the filterable HNSW index depends on filter cardinality. When metadata cardinality is too low, the filter becomes restrictive and it can disrupt the connections within the graph. This leads to fragmented search paths (as in Figure 1). When the semantic search process begins, it won’t be able to travel to those locations.

However, Qdrant still benefits from pre-filtering under certain conditions. In cases of low cardinality, Qdrant's query planner stops using HNSW and switches over to the payload index alone. This makes the search process much cheaper and faster than if using HNSW.

Figure 2: On the user side, this is how filtering looks. We start with five products with different prices. First, the $1000 price filter is applied, narrowing down the selection of laptops. Then, a vector search finds the relevant results within this filtered set.

In conclusion, pre-filtering is efficient in specific cases when you use small datasets with low cardinality metadata. However, pre-filtering should not be used over large datasets as it breaks too many links in the HNSW graph, causing lower accuracy.

Post-filtering

In post-filtering, a search engine first looks for similar vectors and retrieves a larger set of results. Then, it applies filters to those results based on metadata. The problem with post-filtering becomes apparent when using low-cardinality filters.

When you apply a low-cardinality filter after performing a vector search, you often end up discarding a large portion of the results that the vector search returned.

Figure 3: In the same example, we have five laptops. First, the vector search finds the top two relevant results, but they may not meet the price match. When the $1000 price filter is applied, other potential results are discarded.

The system will waste computational resources by first finding similar vectors and then discarding many that don't meet the filter criteria. You're also limited to filtering only from the initial set of vector search results. If your desired items aren't in this initial set, you won't find them, even if they exist in the database.

Basic filtering example: ecommerce and laptops

We know that there are three possible laptops that suit our price point.
Let's see how Qdrant's filterable vector index works and why it is the best method of capturing all available results.

First, add five new laptops to your online store. Here is a sample input:

laptops = [
    (1, [0.1, 0.2, 0.3, 0.4], {"price": 899.99, "category": "laptop"}),
    (2, [0.2, 0.3, 0.4, 0.5], {"price": 1299.99, "category": "laptop"}),
    (3, [0.3, 0.4, 0.5, 0.6], {"price": 799.99, "category": "laptop"}),
    (4, [0.4, 0.5, 0.6, 0.7], {"price": 1099.99, "category": "laptop"}),
    (5, [0.5, 0.6, 0.7, 0.8], {"price": 949.99, "category": "laptop"})
]

The four-dimensional vector can represent features like laptop CPU, RAM or battery life, but that isn’t specified. The payload, however, specifies the exact price and product category.

Now, set the filter to "price is less than $1000":

{
  "key": "price",
  "range": {
    "gt": null,
    "gte": null,
    "lt": null,
    "lte": 1000
  }
}

When a price filter of equal/less than $1000 is applied, vector search returns the following results:

[
  {
    "id": 3,
    "score": 0.9978443564622781,
    "payload": {
      "price": 799.99,
      "category": "laptop"
    }
  },
  {
    "id": 1,
    "score": 0.9938079894227599,
    "payload": {
      "price": 899.99,
      "category": "laptop"
    }
  },
  {
    "id": 5,
    "score": 0.9903751498208603,
    "payload": {
      "price": 949.99,
      "category": "laptop"
    }
  }
]

As you can see, Qdrant's filtering method has a greater chance of capturing all possible search results.

This specific example uses the range condition for filtering. Qdrant, however, offers many other possible ways to structure a filter

For detailed usage examples, filtering docs are the best resource.

Scrolling instead of searching

You don't need to use our search and query APIs to filter through data. The scroll API is another option that lets you retrieve lists of points which meet the filters.

If you aren't interested in finding similar points, you can simply list the ones that match a given filter. While search gives you the most similar points based on some query vector, scroll will give you all points matching your filter not considering similarity.

In Qdrant, scrolling is used to iteratively retrieve large sets of points from a collection. It is particularly useful when you’re dealing with a large number of points and don’t want to load them all at once. Instead, Qdrant provides a way to scroll through the points one page at a time.

You start by sending a scroll request to Qdrant with specific conditions like filtering by payload, vector search, or other criteria.

Let's retrieve a list of top 10 laptops ordered by price in the store:

POST /collections/online_store/points/scroll
{
    "filter": {
        "must": [
            {
                "key": "category",
                "match": {
                    "value": "laptop"
                }
            }
        ]
    },
    "limit": 10,
    "with_payload": true,
    "with_vector": false,
    "order_by": [
        {
            "key": "price",
        }
    ]
}

The response contains a batch of points that match the criteria and a reference (offset or next page token) to retrieve the next set of points.

Scrolling is designed to be efficient. It minimizes the load on the server and reduces memory consumption on the client side by returning only manageable chunks of data at a time.

Available filtering conditions

Condition	Usage	Condition	Usage
Match	Exact value match.	Range	Filter by value range.
Match Any	Match multiple values.	Datetime Range	Filter by date range.
Match Except	Exclude specific values.	UUID Match	Filter by unique ID.
Nested Key	Filter by nested data.	Geo	Filter by location.
Nested Object	Filter by nested objects.	Values Count	Filter by element count.
Full Text Match	Search in text fields.	Is Empty	Filter empty fields.
Has ID	Filter by unique ID.	Is Null	Filter null values.

All clauses and conditions are outlined in Qdrant's filtering documentation.

Filtering clauses to remember

Clause	Description	Clause	Description
Must	Includes items that meet the condition (similar to `AND`).	Should	Filters if at least one condition is met (similar to `OR`).
Must Not	Excludes items that meet the condition (similar to `NOT`).	Clauses Combination	Combines multiple clauses to refine filtering (similar to `AND`).

Advanced filtering example: dinosaur diets

We can also use nested filtering to query arrays of objects within the payload. In this example, we have two points. They each represent a dinosaur with a list of food preferences (diet) that indicate what type of food they like or dislike:

[
  {
    "id": 1,
    "dinosaur": "t-rex",
    "diet": [
      { "food": "leaves", "likes": false},
      { "food": "meat", "likes": true}
    ]
  },
  {
    "id": 2,
    "dinosaur": "diplodocus",
    "diet": [
      { "food": "leaves", "likes": true},
      { "food": "meat", "likes": false}
    ]
  }
]

To ensure that both conditions are applied to the same array element (e.g., food = meat and likes = true must refer to the same diet item), you need to use a nested filter.

Nested filters are used to apply conditions within an array of objects. They ensure that the conditions are evaluated per array element, rather than across all elements.

POST /collections/dinosaurs/points/scroll
{
    "filter": {
        "must": [
            {
                "key": "diet[].food",
                  "match": {
                    "value": "meat"
                }
            },
            {
                "key": "diet[].likes",
                  "match": {
                    "value": true
                }
            }
        ]
    }
}

client.scroll(
    collection_name="dinosaurs",
    scroll_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="diet[].food", match=models.MatchValue(value="meat")
            ),
            models.FieldCondition(
                key="diet[].likes", match=models.MatchValue(value=True)
            ),
        ],
    ),
)

client.scroll("dinosaurs", {
  filter: {
    must: [
      {
        key: "diet[].food",
        match: { value: "meat" },
      },
      {
        key: "diet[].likes",
        match: { value: true },
      },
    ],
  },
});

use qdrant_client::qdrant::{Condition, Filter, ScrollPointsBuilder};

client
    .scroll(
        ScrollPointsBuilder::new("dinosaurs").filter(Filter::must([
            Condition::matches("diet[].food", "meat".to_string()),
            Condition::matches("diet[].likes", true),
        ])),
    )
    .await?;

import java.util.List;

import static io.qdrant.client.ConditionFactory.match;
import static io.qdrant.client.ConditionFactory.matchKeyword;

import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Points.Filter;
import io.qdrant.client.grpc.Points.ScrollPoints;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client
    .scrollAsync(
        ScrollPoints.newBuilder()
            .setCollectionName("dinosaurs")
            .setFilter(
                Filter.newBuilder()
                    .addAllMust(
                        List.of(matchKeyword("diet[].food", "meat"), match("diet[].likes", true)))
                    .build())
            .build())
    .get();

using Qdrant.Client;
using static Qdrant.Client.Grpc.Conditions;

var client = new QdrantClient("localhost", 6334);

await client.ScrollAsync(
    collectionName: "dinosaurs",
    filter: MatchKeyword("diet[].food", "meat") & Match("diet[].likes", true)
);

This happens because both points are matching the two conditions:

the "t-rex" matches food=meat on diet[1].food and likes=true on diet[1].likes
the "diplodocus" matches food=meat on diet[1].food and likes=true on diet[0].likes

To retrieve only the points where the conditions apply to a specific element within an array (such as the point with id 1 in this example), you need to use a nested object filter.

Nested object filters enable querying arrays of objects independently, ensuring conditions are checked within individual array elements.

This is done by using the nested condition type, which consists of a payload key that targets an array and a filter to apply. The key should reference an array of objects and can be written with or without bracket notation (e.g., "data" or "data[]").

POST /collections/dinosaurs/points/scroll
{
    "filter": {
        "must": [{
            "nested": {
                "key": "diet",
                "filter":{
                    "must": [
                        {
                            "key": "food",
                            "match": {
                                "value": "meat"
                            }
                        },
                        {
                            "key": "likes",
                            "match": {
                                "value": true
                            }
                        }
                    ]
                }
            }
        }]
    }
}

client.scroll(
    collection_name="dinosaurs",
    scroll_filter=models.Filter(
        must=[
            models.NestedCondition(
                nested=models.Nested(
                    key="diet",
                    filter=models.Filter(
                        must=[
                            models.FieldCondition(
                                key="food", match=models.MatchValue(value="meat")
                            ),
                            models.FieldCondition(
                                key="likes", match=models.MatchValue(value=True)
                            ),
                        ]
                    ),
                )
            )
        ],
    ),
)

client.scroll("dinosaurs", {
  filter: {
    must: [
      {
        nested: {
          key: "diet",
          filter: {
            must: [
              {
                key: "food",
                match: { value: "meat" },
              },
              {
                key: "likes",
                match: { value: true },
              },
            ],
          },
        },
      },
    ],
  },
});

use qdrant_client::qdrant::{Condition, Filter, NestedCondition, ScrollPointsBuilder};

client
    .scroll(
        ScrollPointsBuilder::new("dinosaurs").filter(Filter::must([NestedCondition {
            key: "diet".to_string(),
            filter: Some(Filter::must([
                Condition::matches("food", "meat".to_string()),
                Condition::matches("likes", true),
            ])),
        }
        .into()])),
    )
    .await?;

import java.util.List;

import static io.qdrant.client.ConditionFactory.match;
import static io.qdrant.client.ConditionFactory.matchKeyword;
import static io.qdrant.client.ConditionFactory.nested;

import io.qdrant.client.grpc.Points.Filter;
import io.qdrant.client.grpc.Points.ScrollPoints;

client
    .scrollAsync(
        ScrollPoints.newBuilder()
            .setCollectionName("dinosaurs")
            .setFilter(
                Filter.newBuilder()
                    .addMust(
                        nested(
                            "diet",
                            Filter.newBuilder()
                                .addAllMust(
                                    List.of(
                                        matchKeyword("food", "meat"), match("likes", true)))
                                .build()))
                    .build())
            .build())
    .get();

using Qdrant.Client;
using static Qdrant.Client.Grpc.Conditions;

var client = new QdrantClient("localhost", 6334);

await client.ScrollAsync(
    collectionName: "dinosaurs",
    filter: Nested("diet", MatchKeyword("food", "meat") & Match("likes", true))
);

The matching logic is adjusted to operate at the level of individual elements within an array in the payload.

Nested filters function as though each element of the array is evaluated separately. The parent document will be considered a match if at least one array element satisfies the nested filter conditions.

Other creative uses for filters

You can use filters to retrieve data points without knowing their id. You can search through data and manage it, solely by using filters. Let's take a look at some creative uses for filters:

Action	Description	Action	Description
Delete Points	Deletes all points matching the filter.	Set Payload	Adds payload fields to all points matching the filter.
Scroll Points	Lists all points matching the filter.	Update Payload	Updates payload fields for points matching the filter.
Order Points	Lists all points, sorted by the filter.	Delete Payload	Deletes fields for points matching the filter.
Count Points	Totals the points matching the filter.

Filtering with the payload index

When you start working with Qdrant, your data is by default organized in a vector index.
In addition to this, we recommend adding a secondary data structure - the payload index.

Just how the vector index organizes vectors, the payload index will structure your metadata.

Figure 4: The payload index is an additional data structure that supports vector search. A payload index (in green) organizes candidate results by cardinality, so that semantic search (in red) can traverse the vector index quickly.

On its own, semantic searching over terabytes of data can take up lots of RAM. Filtering and Indexing are two easy strategies to reduce your compute usage and still get the best results. Remember, this is only a guide. For an exhaustive list of filtering options, you should read the filtering documentation.

Here is how you can create a single index for a metadata field "category":

PUT /collections/computers/index
{
    "field_name": "category",
    "field_schema": "keyword"
}

from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")

client.create_payload_index(
   collection_name="computers",
   field_name="category",
   field_schema="keyword",
)

Once you mark a field indexable, you don't need to do anything else. Qdrant will handle all optimizations in the background.

Why should you index metadata?

The payload index acts as a secondary data structure that speeds up retrieval. Whenever you run vector search with a filter, Qdrant will consult a payload index - if there is one.

If you are indexing your metadata, the difference in search performance can be dramatic.

As your dataset grows in complexity, Qdrant takes up additional resources to go through all data points. Without a proper data structure, the search can take longer - or run out of resources.

Payload indexing helps evaluate the most restrictive filters

The payload index is also used to accurately estimate filter cardinality, which helps the query planning choose a search strategy. Filter cardinality refers to the number of distinct values that a filter can match within a dataset. Qdrant's search strategy can switch from HNSW search to payload index-based search if the cardinality is too low.

How it affects your queries: Depending on the filter used in the search - there are several possible scenarios for query execution. Qdrant chooses one of the query execution options depending on the available indexes, the complexity of the conditions and the cardinality of the filtering result.

The planner estimates the cardinality of a filtered result before selecting a strategy.
Qdrant retrieves points using the payload index if cardinality is below threshold.
Qdrant uses the filterable vector index if the cardinality is above a threshold

What happens if you don't use payload indexes?

If you only rely on searching for the nearest vector, Qdrant will have to go through the entire vector index. It will calculate similarities against each vector in the collection, relevant or not. Alternatively, when you filter with the help of a payload index, the HSNW algorithm won't have to evaluate every point. Furthermore, the payload index will help HNSW construct the graph with additional links.

How does the payload index look?

A payload index is similar to conventional document-oriented databases. It connects metadata fields with their corresponding point id’s for quick retrieval.

In this example, you are indexing all of your computer hardware inside of the computers collection. Let’s take a look at a sample payload index for the field category.

Payload Index by keyword:
+------------+-------------+
| category   | id          |
+------------+-------------+
| laptop     | 1, 4, 7     |
| desktop    | 2, 5, 9     |
| speakers   | 3, 6, 8     |
| keyboard   | 10, 11      |
+------------+-------------+

When fields are properly indexed, the search engine roughly knows where it can start its journey. It can start looking up points that contain relevant metadata, and it doesn’t need to scan the entire dataset. This reduces the engine’s workload by a lot. As a result, query results are faster and the system can easily scale.

You may create as many payload indexes as you want, and we recommend you do so for each field that is frequently used.
If your users are often filtering by laptop when looking up a product category, indexing all computer metadata will speed up retrieval and make the results more precise.

Different types of payload indexes

Index Type	Description
Full-text Index	Enables efficient text search in large datasets.
Tenant Index	For data isolation and retrieval efficiency in multi-tenant architectures.
Principal Index	Manages data based on primary entities like users or accounts.
On-Disk Index	Stores indexes on disk to manage large datasets without memory usage.
Parameterized Index	Allows for dynamic querying, where the index can adapt based on different parameters or conditions provided by the user. Useful for numeric data like prices or timestamps.

Indexing payloads in multitenant setups

Some applications need to have data segregated, whereby different users need to see different data inside of the same program. When setting up storage for such a complex application, many users think they need multiple databases for segregated users.

We see this quite often. Users very frequently make the mistake of creating a separate collection for each tenant inside of the same cluster. This can quickly exhaust the cluster’s resources. Running vector search through too many collections can start using up too much RAM. You may start seeing out-of-memory (OOM) errors and degraded performance.

To mitigate this, we offer extensive support for multitenant systems, so that you can build an entire global application in one single Qdrant collection.

PUT /collections/{collection_name}/index
{
   "field_name": "payload_field_name",
   "field_schema": {
       "type": "keyword",
       "is_tenant": true
   }
}

The tenant index is another variant of the payload index. When creating or updating a collection, you can mark a metadata field as indexable. This time, the request will specify the field as a tenant. This means that you can mark various user types and customer id’s as is_tenant: true.

Key takeaways in filtering and indexing

Filtering with float-point (decimal) numbers

If you filter by the float data type, your search precision may be limited and inaccurate.

Float Datatype numbers have a decimal point and are 64 bits in size. Here is an example:

{
   "price": 11.99
}

When you filter for a specific float number, such as 11.99, you may get a different result, like 11.98 or 12.00. With decimals, numbers are rounded differently, so logically identical values may appear different. Unfortunately, searching for exact matches can be unreliable in this case.

To avoid inaccuracies, use a different filtering method. We recommend that you try Range Based Filtering instead of exact matches. This method accounts for minor variations in data, and it boosts performance - especially with large datasets.

Here is a sample JSON range filter for values greater than or equal to 11.99 and less than or equal to the same number. This will retrieve any values within the range of 11.99, including those with additional decimal places.

{
 "key": "price",
 "range": {
   "gt": null,
   "gte": 11.99,
   "lt": null,
   "lte": 11.99
  }
}

Working with pagination in queries

When you're implementing pagination in filtered queries, indexing becomes even more critical. When paginating results, you often need to exclude items you've already seen. This is typically managed by applying filters that specify which IDs should not be included in the next set of results.

However, an interesting aspect of Qdrant's data model is that a single point can have multiple values for the same field, such as different color options for a product. This means that during filtering, an ID might appear multiple times if it matches on different values of the same field.

Proper indexing ensures that these queries are efficient, preventing duplicate results and making pagination smoother.

Conclusion: Real-life use cases of filtering

Filtering in a vector database like Qdrant can significantly enhance search capabilities by enabling more precise and efficient retrieval of data.

As a conclusion to this guide, let's look at some real-life use cases where filtering is crucial:

Use Case	Vector Search	Filtering
E-Commerce Product Search	Search for products by style or visual similarity	Filter by price, color, brand, size, ratings
Recommendation Systems	Recommend similar content (e.g., movies, songs)	Filter by release date, genre, etc. (e.g., movies after 2020)
Geospatial Search in Ride-Sharing	Find similar drivers or delivery partners	Filter by rating, distance radius, vehicle type
Fraud & Anomaly Detection	Detect transactions similar to known fraud cases	Filter by amount, time, location

Before you go - all the code is in Qdrant's Dashboard

The easiest way to reach that "Hello World" moment is to try filtering in a live cluster. Our interactive tutorial will show you how to create a cluster, add data and try some filtering clauses.

Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance!

Kacper Łukawski — Thu, 29 Aug 2024 15:59:51 +0000

* At least any open-source model, since you need access to its internals.

You Can Adapt Dense Embedding Models for Late Interaction

Qdrant 1.10 introduced support for multi-vector representations, with late interaction being a prominent example of this model. In essence, both documents and queries are represented by multiple vectors, and identifying the most relevant documents involves calculating a score based on the similarity between corresponding query and document embeddings. If you're not familiar with this paradigm, our updated Hybrid Search article explains how multi-vector representations can enhance retrieval quality.

Figure 1: We can visualize late interaction between corresponding document-query embedding pairs.

There are many specialized late interaction models, such as ColBERT, but it appears that regular dense embedding models can also be effectively utilized in this manner.

In this study, we will demonstrate that standard dense embedding models, traditionally used for single-vector representations, can be effectively adapted for late interaction scenarios using output token embeddings as multi-vector representations.

By testing out retrieval with Qdrant’s multi-vector feature, we will show that these models can rival or surpass specialized late interaction models in retrieval performance, while offering lower complexity and greater efficiency. This work redefines the potential of dense models in advanced search pipelines, presenting a new method for optimizing retrieval systems.

Understanding Embedding Models

The inner workings of embedding models might be surprising to some. The model doesn’t operate directly on the input text; instead, it requires a tokenization step to convert the text into a sequence of token identifiers. Each token identifier is then passed through an embedding layer, which transforms it into a dense vector. Essentially, the embedding layer acts as a lookup table that maps token identifiers to dense vectors. These vectors are then fed into the transformer model as input.

Figure 2: The tokenization step, which takes place before vectors are added to the transformer model.

The input token embeddings are context-free and are learned during the model’s training process. This means that each token always receives the same embedding, regardless of its position in the text. At this stage, the token embeddings are unaware of the context in which they appear. It is the transformer model’s role to contextualize these embeddings.

Much has been discussed about the role of attention in transformer models, but in essence, this mechanism is responsible for capturing cross-token relationships. Each transformer module takes a sequence of token embeddings as input and produces a sequence of output token embeddings. Both sequences are of the same length, with each token embedding being enriched by information from the other token embeddings at the current step.

Figure 3: The mechanism which produces a sequence of output token embeddings.

Figure 4: The final step performed by the embedding model is pooling the output token embeddings to generate a single vector representation of the input text.

There are several pooling strategies, but regardless of which one a model uses, the output is always a single vector representation, which inevitably loses some information about the input. It’s akin to giving someone detailed, step-by-step directions to the nearest grocery store versus simply pointing in the general direction. While the vague direction might suffice in some cases, the detailed instructions are more likely to lead to the desired outcome.

Using Output Token Embeddings for Multi-Vector Representations

We often overlook the output token embeddings, but the fact is—they also serve as multi-vector representations of the input text. So, why not explore their use in a multi-vector retrieval model, similar to late interaction models?

Experimental Findings

We conducted several experiments to determine whether output token embeddings could be effectively used in place of traditional late interaction models. The results are quite promising.

Dataset	Model	Experiment	NDCG@10
SciFact	`prithivida/Splade_PP_en_v1`	sparse vectors	0.70928
	`colbert-ir/colbertv2.0`	late interaction model	0.69579
	`all-MiniLM-L6-v2`	single dense vector representation	0.64508
	`all-MiniLM-L6-v2`	output token embeddings	0.70724
	`BAAI/bge-small-en`	single dense vector representation	0.68213
	`BAAI/bge-small-en`	output token embeddings	0.73696

NFCorpus	`prithivida/Splade_PP_en_v1`	sparse vectors	0.34166
	`colbert-ir/colbertv2.0`	late interaction model	0.35036
	`all-MiniLM-L6-v2`	single dense vector representation	0.31594
	`all-MiniLM-L6-v2`	output token embeddings	0.35779
	`BAAI/bge-small-en`	single dense vector representation	0.29696
	`BAAI/bge-small-en`	output token embeddings	0.37502

ArguAna	`prithivida/Splade_PP_en_v1`	sparse vectors	0.47271
	`colbert-ir/colbertv2.0`	late interaction model	0.44534
	`all-MiniLM-L6-v2`	single dense vector representation	0.50167
	`all-MiniLM-L6-v2`	output token embeddings	0.45997
	`BAAI/bge-small-en`	single dense vector representation	0.58857
	`BAAI/bge-small-en`	output token embeddings	0.57648

The source code for these experiments is open-source and utilizes beir-qdrant, an integration of Qdrant with the BeIR library. While this package is not officially maintained by the Qdrant team, it may prove useful for those interested in experimenting with various Qdrant configurations to see how they impact retrieval quality. All experiments were conducted using Qdrant in exact search mode, ensuring the results are not influenced by approximate search.

Even the simple all-MiniLM-L6-v2 model can be applied in a late interaction model fashion, resulting in a positive impact on retrieval quality. However, the best results were achieved with the BAAI/bge-small-en model, which outperformed both sparse and late interaction models.

It's important to note that ColBERT has not been trained on BeIR datasets, making its performance fully out-of-domain. Nevertheless, the all-MiniLM-L6-v2 training dataset also lacks any BeIR data, yet it still performs remarkably well.

Comparative Analysis of Dense vs. Late Interaction Models

The retrieval quality speaks for itself, but there are other important factors to consider.

The traditional dense embedding models we tested are less complex than late interaction or sparse models. With fewer parameters, these models are expected to be faster during inference and more cost-effective to maintain. Below is a comparison of the models used in the experiments:

Model	Number of parameters
`prithivida/Splade_PP_en_v1`	109,514,298
`colbert-ir/colbertv2.0`	109,580,544
`BAAI/bge-small-en`	33,360,000
`all-MiniLM-L6-v2`	22,713,216

One argument against using output token embeddings is the increased storage requirements compared to ColBERT-like models. For instance, the all-MiniLM-L6-v2 model produces 384-dimensional output token embeddings, which is three times more than the 128-dimensional embeddings generated by ColBERT-like models. This increase not only leads to higher memory usage but also impacts the computational cost of retrieval, as calculating distances takes more time. Mitigating this issue through vector compression would make a lot of sense.

Exploring Quantization for Multi-Vector Representations

Binary quantization is generally more effective for high-dimensional vectors, making the all-MiniLM-L6-v2 model, with its relatively low-dimensional outputs, less ideal for this approach. However, scalar quantization appeared to be a viable alternative. The table below summarizes the impact of quantization on retrieval quality.

Dataset	Model	Experiment	NDCG@10
SciFact	`all-MiniLM-L6-v2`	output token embeddings	0.70724
SciFact	`all-MiniLM-L6-v2`	output token embeddings (uint8)	0.70297

NFCorpus	`all-MiniLM-L6-v2`	output token embeddings	0.35779
NFCorpus	`all-MiniLM-L6-v2`	output token embeddings (uint8)	0.35572

It’s important to note that quantization doesn’t always preserve retrieval quality at the same level, but in this case, scalar quantization appears to have minimal impact on retrieval performance. The effect is negligible, while the memory savings are substantial.

We managed to maintain the original quality while using four times less memory. Additionally, a quantized vector requires 384 bytes, compared to ColBERT’s 512 bytes. This results in a 25% reduction in memory usage, with retrieval quality remaining nearly unchanged.

Practical Application: Enhancing Retrieval with Dense Models

If you’re using one of the sentence transformer models, the output token embeddings are calculated by default. While a single vector representation is more efficient in terms of storage and computation, there’s no need to discard the output token embeddings. According to our experiments, these embeddings can significantly enhance retrieval quality. You can store both the single vector and the output token embeddings in Qdrant, using the single vector for the initial retrieval step and then reranking the results with the output token embeddings.

Figure 5: A single model pipeline that relies solely on the output token embeddings for reranking.

To demonstrate this concept, we implemented a simple reranking pipeline in Qdrant. This pipeline uses a dense embedding model for the initial oversampled retrieval and then relies solely on the output token embeddings for the reranking step.

Single Model Retrieval and Reranking Benchmarks

Our tests focused on using the same model for both retrieval and reranking. The reported metric is NDCG@10. In all tests, we applied an oversampling factor of 5x, meaning the retrieval step returned 50 results, which were then narrowed down to 10 during the reranking step. Below are the results for some of the BeIR datasets:

Dataset	`all-miniLM-L6-v2`		`BAAI/bge-small-en`
Dataset	dense embeddings only	dense + reranking	dense embeddings only	dense + reranking
SciFact	0.64508	0.70293	0.68213	0.73053
NFCorpus	0.31594	0.34297	0.29696	0.35996
ArguAna	0.50167	0.45378	0.58857	0.57302
Touche-2020	0.16904	0.19693	0.13055	0.19821
TREC-COVID	0.47246	0.6379	0.45788	0.53539
FiQA-2018	0.36867	0.41587	0.31091	0.39067

The source code for the benchmark is publicly available, and you can find it in the repository of the beir-qdrant package.

Overall, adding a reranking step using the same model typically improves retrieval quality. However, the quality of various late interaction models is often reported based on their reranking performance when BM25 is used for the initial retrieval. This experiment aimed to demonstrate how a single model can be effectively used for both retrieval and reranking, and the results are quite promising.

Now, let's explore how to implement this using the new Query API introduced in Qdrant 1.10.

Implementation Guide: Setting Up Qdrant for Late Interaction

The new Query API in Qdrant 1.10 enables the construction of even more complex retrieval pipelines. We can use the single vector created after pooling for the initial retrieval step and then rerank the results using the output token embeddings.

Assuming the collection is named my-collection and is configured to store two named vectors: dense-vector and output-token-embeddings, here’s how such a collection could be created in Qdrant:

from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="my-collection",
    vectors_config={
        "dense-vector": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
        ),
        "output-token-embeddings": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
        ),
    }
)

Both vectors are of the same size since they are produced by the same all-MiniLM-L6-v2 model.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

Now, instead of using the search API with just a single dense vector, we can create a reranking pipeline. First, we retrieve 50 results using the dense vector, and then we rerank them using the output token embeddings to obtain the top 10 results.

query = "What else can be done with just all-MiniLM-L6-v2 model?"

client.query_points(
    collection_name="my-collection",
    prefetch=[
        # Prefetch the dense embeddings of the top-50 documents
        models.Prefetch(
            query=model.encode(query).tolist(),
            using="dense-vector",
            limit=50,
        )
    ],
    # Rerank the top-50 documents retrieved by the dense embedding model
    # and return just the top-10. Please note we call the same model, but
    # we ask for the token embeddings by setting the output_value parameter.
    query=model.encode(query, output_value="token_embeddings").tolist(),
    using="output-token-embeddings",
    limit=10,
)

Try the Experiment Yourself

In a real-world scenario, you might take it a step further by first calculating the token embeddings and then performing pooling to obtain the single vector representation. This approach allows you to complete everything in a single pass.

The simplest way to start experimenting with building complex reranking pipelines in Qdrant is by using the forever-free cluster on Qdrant Cloud and reading Qdrant's documentation.

The source code for these experiments is open-source and uses beir-qdrant, an integration of Qdrant with the BeIR library.

Future Directions and Research Opportunities

The initial experiments using output token embeddings in the retrieval process have yielded promising results. However, we plan to conduct further benchmarks to validate these findings and explore the incorporation of sparse methods for the initial retrieval. Additionally, we aim to investigate the impact of quantization on multi-vector representations and its effects on retrieval quality. Finally, we will assess retrieval speed, a crucial factor for many applications.

Optimizing RAG Through an Evaluation-Based Methodology

Atita Arora — Wed, 03 Jul 2024 13:18:22 +0000

In today's fast-paced, information-rich world, AI is revolutionizing knowledge management. The systematic process of capturing, distributing, and effectively using knowledge within an organization is one of the fields in which AI provides exceptional value today.

The potential for AI-powered knowledge management increases when leveraging Retrieval Augmented Generation (RAG), a methodology that enables LLMs to access a vast, diverse repository of factual information from knowledge stores, such as vector databases.

This process enhances the accuracy, relevance, and reliability of generated text, thereby mitigating the risk of faulty, incorrect, or nonsensical results sometimes associated with traditional LLMs. This method not only ensures that the answers are contextually relevant but also up-to-date, reflecting the latest insights and data available.

While RAG enhances the accuracy, relevance, and reliability of traditional LLM solutions, an evaluation strategy can further help teams ensure their AI products meet these benchmarks of success.

Relevant tools for this experiment

In this article, we’ll break down a RAG Optimization workflow experiment that demonstrates that evaluation is essential to build a successful RAG strategy. We will use Qdrant and Quotient for this experiment.

Qdrant is a vector database and vector similarity search engine designed for efficient storage and retrieval of high-dimensional vectors. Because Qdrant offers efficient indexing and searching capabilities, it is ideal for implementing RAG solutions, where quickly and accurately retrieving relevant information from extremely large datasets is crucial. Qdrant also offers a wealth of additional features, such as quantization, multivector support and multi-tenancy.

Alongside Qdrant we will use Quotient, which provides a seamless way to evaluate your RAG implementation, accelerating and improving the experimentation process.

Quotient is a platform that provides tooling for AI developers to build evaluation frameworks and conduct experiments on their products. Evaluation is how teams surface the shortcomings of their applications and improve performance in key benchmarks such as faithfulness, and semantic similarity. Iteration is key to building innovative AI products that will deliver value to end users.

💡 The accompanying notebook for this exercise can be found on GitHub for future reference.

Summary of key findings

Irrelevance and Hallucinations: When the documents retrieved are irrelevant, evidenced by low scores in both Chunk Relevance and Context Relevance, the model is prone to generating inaccurate or fabricated information.
Optimizing Document Retrieval: By retrieving a greater number of documents and reducing the chunk size, we observed improved outcomes in the model's performance.
Adaptive Retrieval Needs: Certain queries may benefit from accessing more documents. Implementing a dynamic retrieval strategy that adjusts based on the query could enhance accuracy.
Influence of Model and Prompt Variations: Alterations in language models or the prompts used can significantly impact the quality of the generated responses, suggesting that fine-tuning these elements could optimize performance.

Let us walk you through how we arrived at these findings!

Building a RAG pipeline

To evaluate a RAG pipeline , we will have to build a RAG Pipeline first. In the interest of simplicity, we are building a Naive RAG in this article. There are certainly other versions of RAG :

The illustration below depicts how we can leverage a RAG Evaluation framework to assess the quality of RAG Application.

We are going to build a RAG application using Qdrant’s Documentation and the premeditated hugging face dataset.
We will then assess our RAG application’s ability to answer questions about Qdrant.

To prepare our knowledge store we will use Qdrant, which can be leveraged in 3 different ways as below :

##Uncomment to initialise qdrant client in memory
#client = qdrant_client.QdrantClient(
#    location=":memory:",
#)

##Uncomment below to connect to Qdrant Cloud
client = qdrant_client.QdrantClient(
    os.environ.get("QDRANT_URL"),
    api_key=os.environ.get("QDRANT_API_KEY"),
)

## Uncomment below to connect to local Qdrant
#client = qdrant_client.QdrantClient("http://localhost:6333")

We will be using Qdrant Cloud so it is a good idea to provide the QDRANT_URL and QDRANT_API_KEY as environment variables for easier access.

Moving on, we will need to define the collection name as :

COLLECTION_NAME = "qdrant-docs-quotient"

In this case , we may need to create different collections based on the experiments we conduct.

To help us provide seamless embedding creations throughout the experiment, we will use Qdrant’s native embedding provider Fastembed which supports many different models including dense as well as sparse vector models.

We can initialize and switch the embedding model of our choice as below :

## Declaring the intended Embedding Model with Fastembed
from fastembed.embedding import TextEmbedding

## General Fastembed specific operations
##Initilising embedding model
## Using Default Model - BAAI/bge-small-en-v1.5
embedding_model = TextEmbedding()

## For custom model supported by Fastembed
#embedding_model = TextEmbedding(model_name="BAAI/bge-small-en", max_length=512)
#embedding_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2", max_length=384)

## Verify the chosen Embedding model
embedding_model.model_name

Before implementing RAG, we need to prepare and index our data in Qdrant.

This involves converting textual data into vectors using a suitable encoder (e.g., sentence transformers), and storing these vectors in Qdrant for retrieval.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument

## Load the dataset with qdrant documentation
dataset = load_dataset("atitaarora/qdrant_doc", split="train")

## Dataset to langchain document
langchain_docs = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in dataset
]

len(langchain_docs)

#Outputs
#240

You can preview documents in the dataset as below :

## Here's an example of what a document in our dataset looks like
print(dataset[100]['text'])

Evaluation dataset

To measure the quality of our RAG setup, we will need a representative evaluation dataset. This dataset should contain realistic questions and the expected answers.

Additionally, including the expected contexts for which your RAG pipeline is designed to retrieve information would be beneficial.

We will be using a prebuilt evaluation dataset.

If you are struggling to make an evaluation dataset for your use case , you can use your documents and some techniques described in this notebook

Building the RAG pipeline

We establish the data preprocessing parameters essential for the RAG pipeline and configure the Qdrant vector database according to the specified criteria.

Key parameters under consideration are:

Chunk size
Chunk overlap
Embedding model
Number of documents retrieved (retrieval window)

Following the ingestion of data in Qdrant, we proceed to retrieve pertinent documents corresponding to each query. These documents are then seamlessly integrated into our evaluation dataset, enriching the contextual information within the designated context column to fulfil the evaluation aspect.

Next we define methods to take care of logistics with respect to adding documents to Qdrant

def add_documents(client, collection_name, chunk_size, chunk_overlap, embedding_model_name):
    """
    This function adds documents to the desired Qdrant collection given the specified RAG parameters.
    """

    ## Processing each document with desired TEXT_SPLITTER_ALGO, CHUNK_SIZE, CHUNK_OVERLAP
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        add_start_index=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in langchain_docs:
        docs_processed += text_splitter.split_documents([doc])

    ## Processing documents to be encoded by Fastembed
    docs_contents = []
    docs_metadatas = []

    for doc in docs_processed:
        if hasattr(doc, 'page_content') and hasattr(doc, 'metadata'):
            docs_contents.append(doc.page_content)
            docs_metadatas.append(doc.metadata)
        else:
            # Handle the case where attributes are missing
            print("Warning: Some documents do not have 'page_content' or 'metadata' attributes.")

    print("processed: ", len(docs_processed))
    print("content: ", len(docs_contents))
    print("metadata: ", len(docs_metadatas))

    ## Adding documents to Qdrant using desired embedding model
    client.set_model(embedding_model_name=embedding_model_name)
    client.add(collection_name=collection_name, metadata=docs_metadatas, documents=docs_contents)

and retrieving documents from Qdrant during our RAG Pipeline assessment.

def get_documents(collection_name, query, num_documents=3):
    """
    This function retrieves the desired number of documents from the Qdrant collection given a query.
    It returns a list of the retrieved documents.
    """
    search_results = client.query(
        collection_name=collection_name,
        query_text=query,
        limit=num_documents,
    )
    results = [r.metadata["document"] for r in search_results]
    return results

Setting up Quotient

You will need an account log in, which you can get by requesting access on Quotient's website. Once you have an account, you can create an API key by running the quotient authenticate CLI command.

💡 Be sure to save your API key, since it will only be displayed once (Note: you will not have to repeat this step again until your API key expires).

Once you have your API key, make sure to set it as an environment variable called QUOTIENT_API_KEY

# Import QuotientAI client and connect to QuotientAI
from quotientai.client import QuotientClient
from quotientai.utils import show_job_progress

# IMPORTANT: be sure to set your API key as an environment variable called QUOTIENT_API_KEY
# You will need this set before running the code below. You may also uncomment the following line and insert your API key:
# os.environ['QUOTIENT_API_KEY'] = "YOUR_API_KEY"

quotient = QuotientClient()

QuotientAI provides a seamless way to integrate RAG evaluation into your applications. Here, we'll see how to use it to evaluate text generated from an LLM, based on retrieved knowledge from the Qdrant vector database.

After retrieving the top similar documents and populating the context column, we can submit the evaluation dataset to Quotient and execute an evaluation job. To run a job, all you need is your evaluation dataset and a recipe.

A recipe is a combination of a prompt template and a specified LLM.

Quotient orchestrates the evaluation run and handles version control and asset management throughout the experimentation process.

Prior to assessing our RAG solution, it's crucial to outline our optimization goals.

In the context of question-answering on Qdrant documentation, our focus extends beyond merely providing helpful responses. Ensuring the absence of any inaccurate or misleading information is paramount.

In other words, we want to minimize hallucinations in the LLM outputs.

For our evaluation, we will be considering the following metrics, with a focus on Faithfulness:

Context Relevance
Chunk Relevance
Faithfulness
ROUGE-L
BERT Sentence Similarity
BERTScore

Evaluation in action

The function below takes an evaluation dataset as input, which in this case contains questions and their corresponding answers. It retrieves relevant documents based on the questions in the dataset and populates the context field with this information from Qdrant. The prepared dataset is then submitted to QuotientAI for evaluation for the chosen metrics. After the evaluation is complete, the function displays aggregated statistics on the evaluation metrics followed by the summarized evaluation results.

def run_eval(eval_df, collection_name, recipe_id, num_docs=3, path="eval_dataset_qdrant_questions.csv"):
    """
    This function evaluates the performance of a complete RAG pipeline on a given evaluation dataset.

    Given an evaluation dataset (containing questions and ground truth answers),
    this function retrieves relevant documents, populates the context field, and submits the dataset to QuotientAI for evaluation.
    Once the evaluation is complete, aggregated statistics on the evaluation metrics are displayed.

    The evaluation results are returned as a pandas dataframe.
    """

    # Add context to each question by retrieving relevant documents
    eval_df['documents'] = eval_df.apply(lambda x: get_documents(collection_name=collection_name,
                                                                query=x['input_text'],
                                                                num_documents=num_docs), axis=1)
    eval_df['context'] = eval_df.apply(lambda x: "\n".join(x['documents']), axis=1)

    # Now we'll save the eval_df to a CSV
    eval_df.to_csv(path, index=False)

    # Upload the eval dataset to QuotientAI
    dataset = quotient.create_dataset(
        file_path=path,
        name="qdrant-questions-eval-v1",
    )

    # Create a new task for the dataset
    task = quotient.create_task(
        dataset_id=dataset['id'],
        name='qdrant-questions-qa-v1',
        task_type='question_answering'
    )

    # Run a job to evaluate the model
    job = quotient.create_job(
        task_id=task['id'],
        recipe_id=recipe_id,
        num_fewshot_examples=0,
        limit=500,
        metric_ids=[5, 7, 8, 11, 12, 13, 50],
    )

    # Show the progress of the job
    show_job_progress(quotient, job['id'])

    # Once the job is complete, we can get our results
    data = quotient.get_eval_results(job_id=job['id'])

    # Add the results to a pandas dataframe to get statistics on performance
    df = pd.json_normalize(data, "results")
    df_stats = df[df.columns[df.columns.str.contains("metric|completion_time")]]

    df.columns = df.columns.str.replace("metric.", "")
    df_stats.columns = df_stats.columns.str.replace("metric.", "")

    metrics = {
        'completion_time_ms':'Completion Time (ms)',
        'chunk_relevance': 'Chunk Relevance',
        'selfcheckgpt_nli_relevance':"Context Relevance",
        'selfcheckgpt_nli':"Faithfulness",
        'rougeL_fmeasure':"ROUGE-L",
        'bert_score_f1':"BERTScore",
        'bert_sentence_similarity': "BERT Sentence Similarity",
        'completion_verbosity':"Completion Verbosity",
        'verbosity_ratio':"Verbosity Ratio",}

    df = df.rename(columns=metrics)
    df_stats = df_stats.rename(columns=metrics)

    display(df_stats[metrics.values()].describe())

    return df

main_metrics = [
      'Context Relevance',
      'Chunk Relevance',
      'Faithfulness',
      'ROUGE-L',
      'BERT Sentence Similarity',
      'BERTScore',
      ]

Experimentation

Our approach is rooted in the belief that improvement thrives in an environment of exploration and discovery. By systematically testing and tweaking various components of the RAG pipeline, we aim to incrementally enhance its capabilities and performance.

In the following section, we dive into the details of our experimentation process, outlining the specific experiments conducted and the insights gained.

Experiment 1 - Baseline

Parameters

Embedding Model: bge-small-en
Chunk size: 512
Chunk overlap: 64
Number of docs retrieved (Retireval Window): 3
LLM: Mistral-7B-Instruct

We’ll process our documents based on configuration above and ingest them into Qdrant using add_documents method introduced earlier

#experiment1 - base config
chunk_size = 512
chunk_overlap = 64
embedding_model_name = "BAAI/bge-small-en"
num_docs = 3

COLLECTION_NAME = f"experiment_{chunk_size}_{chunk_overlap}_{embedding_model_name.split('/')[1]}"

add_documents(client,
              collection_name=COLLECTION_NAME,
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap,
              embedding_model_name=embedding_model_name)

#Outputs
#processed: 4504
#content:   4504
#metadata:  4504

Notice the COLLECTION_NAME which helps us segregate and identify our collections based on the experiments conducted.

To proceed with the evaluation, let’s create the evaluation recipe up next

# Create a recipe for the generator model and prompt template
recipe_mistral = quotient.create_recipe(
    model_id=10,
    prompt_template_id=1,
    name='mistral-7b-instruct-qa-with-rag',
    description='Mistral-7b-instruct using a prompt template that includes context.'
)
recipe_mistral

#Outputs recipe JSON with the used prompt template
#'prompt_template': {'id': 1,
#  'name': 'Default Question Answering Template',
#  'variables': '["input_text","context"]',
#  'created_at': '2023-12-21T22:01:54.632367',
#  'template_string': 'Question: {input_text}\\n\\nContext: {context}\\n\\nAnswer:',
#  'owner_profile_id': None}

To get a list of your existing recipes, you can simply run:

quotient.list_recipes()

Notice the recipe template is a simplest prompt using Question from evaluation template Context from document chunks retrieved from Qdrant and Answer generated by the pipeline.

To kick off the evaluation

# Kick off an evaluation job
experiment_1 = run_eval(eval_df,
                        collection_name=COLLECTION_NAME,
                        recipe_id=recipe_mistral['id'],
                        num_docs=num_docs,
                        path=f"{COLLECTION_NAME}_{num_docs}_mistral.csv")

This may take few minutes (depending on the size of evaluation dataset!)

We can look at the results from our first (baseline) experiment as below :

Notice that we have a pretty low average Chunk Relevance and very large standard deviations for both Chunk Relevance and Context Relevance.

Let's take a look at some of the lower performing datapoints with poor Faithfulness:

with pd.option_context('display.max_colwidth', 0):
    display(experiment_1[['content.input_text', 'content.answer','content.documents','Chunk Relevance','Context Relevance','Faithfulness']
                ].sort_values(by='Faithfulness').head(2))

In instances where the retrieved documents are irrelevant (where both Chunk Relevance and Context Relevance are low), the model also shows tendencies to hallucinate and produce poor quality responses.

The quality of the retrieved text directly impacts the quality of the LLM-generated answer. Therefore, our focus will be on enhancing the RAG setup by adjusting the chunking parameters.

Experiment 2 - Adjusting the chunk parameter

Keeping all other parameters constant, we changed the chunk size and chunk overlap to see if we can improve our results.

Parameters :

Embedding Model : bge-small-en
Chunk size: 1024
Chunk overlap: 128
Number of docs retrieved (Retireval Window): 3
LLM: Mistral-7B-Instruct

We will reprocess the data with the updated parameters above:

## for iteration 2 - lets modify chunk configuration
## We will start with creating seperate collection to store vectors

chunk_size = 1024
chunk_overlap = 128
embedding_model_name = "BAAI/bge-small-en"
num_docs = 3

COLLECTION_NAME = f"experiment_{chunk_size}_{chunk_overlap}_{embedding_model_name.split('/')[1]}"

add_documents(client,
              collection_name=COLLECTION_NAME,
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap,
              embedding_model_name=embedding_model_name)

#Outputs
#processed: 2152
#content:   2152
#metadata:  2152

Followed by running evaluation :

and comparing it with the results from Experiment 1:

We observed slight enhancements in our LLM completion metrics (including BERT Sentence Similarity, BERTScore, ROUGE-L, and Knowledge F1) with the increase in chunk size. However, it's noteworthy that there was a significant decrease in Faithfulness, which is the primary metric we are aiming to optimize.

Moreover, Context Relevance demonstrated an increase, indicating that the RAG pipeline retrieved more relevant information required to address the query. Nonetheless, there was a considerable drop in Chunk Relevance, implying that a smaller portion of the retrieved documents contained pertinent information for answering the question.

The correlation between the rise in Context Relevance and the decline in Chunk Relevance suggests that retrieving more documents using the smaller chunk size might yield improved results.

Experiment 3 - Increasing the number of documents retrieved (retrieval window)

This time, we are using the same RAG setup as Experiment 1, but increasing the number of retrieved documents from 3 to 5.

Parameters :

Embedding Model : bge-small-en
Chunk size: 512
Chunk overlap: 64
Number of docs retrieved (Retrieval Window): 5
LLM: : Mistral-7B-Instruct

We can use the collection from Experiment 1 and run evaluation with modified num_docs parameter as :

#collection name from Experiment 1
COLLECTION_NAME = f"experiment_{chunk_size}_{chunk_overlap}_{embedding_model_name.split('/')[1]}"

#running eval for experiment 3
experiment_3 = run_eval(eval_df,
                        collection_name=COLLECTION_NAME,
                        recipe_id=recipe_mistral['id'],
                        num_docs=num_docs,
                        path=f"{COLLECTION_NAME}_{num_docs}_mistral.csv")

Observe the results as below :

Comparing the results with Experiments 1 and 2 :

As anticipated, employing the smaller chunk size while retrieving a larger number of documents resulted in achieving the highest levels of both Context Relevance and Chunk Relevance. Additionally, it yielded the best (albeit marginal) Faithfulness score, indicating a reduced occurrence of inaccuracies or hallucinations.

Looks like we have achieved a good hold on our chunking parameters but it is worth testing another embedding model to see if we can get better results.

Experiment 4 - Changing the embedding model

Let us try using MiniLM for this experiment
****Parameters :

Embedding Model : MiniLM-L6-v2
Chunk size: 512
Chunk overlap: 64
Number of docs retrieved (Retrieval Window): 5
LLM: : Mistral-7B-Instruct

We will have to create another collection for this experiment :

#experiment-4
chunk_size=512
chunk_overlap=64
embedding_model_name="sentence-transformers/all-MiniLM-L6-v2"
num_docs=5

COLLECTION_NAME = f"experiment_{chunk_size}_{chunk_overlap}_{embedding_model_name.split('/')[1]}"

add_documents(client,
              collection_name=COLLECTION_NAME,
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap,
              embedding_model_name=embedding_model_name)

#Outputs
#processed: 4504
#content:   4504
#metadata:  4504

We will observe our evaluations as :

Comparing these with our previous experiments :

It appears that bge-small was more proficient in capturing the semantic nuances of the Qdrant Documentation.

Up to this point, our experimentation has focused solely on the retrieval aspect of our RAG pipeline. Now, let's explore altering the generation aspect or LLM while retaining the optimal parameters identified in Experiment 3.

Experiment 5 - Changing the LLM

Parameters :

Embedding Model : bge-small-en
Chunk size: 512
Chunk overlap: 64
Number of docs retrieved (Retrieval Window): 5
LLM: : GPT-3.5-turbo

For this we can repurpose our collection from Experiment 3 while the evaluations to use a new recipe with GPT-3.5-turbo model.

#collection name from Experiment 3
COLLECTION_NAME = f"experiment_{chunk_size}_{chunk_overlap}_{embedding_model_name.split('/')[1]}"

# We have to create a recipe using the same prompt template and GPT-3.5-turbo
recipe_gpt = quotient.create_recipe(
    model_id=5,
    prompt_template_id=1,
    name='gpt3.5-qa-with-rag-recipe-v1',
    description='GPT-3.5 using a prompt template that includes context.'
)

recipe_gpt

#Outputs
#{'id': 495,
# 'name': 'gpt3.5-qa-with-rag-recipe-v1',
# 'description': 'GPT-3.5 using a prompt template that includes context.',
# 'model_id': 5,
# 'prompt_template_id': 1,
# 'created_at': '2024-05-03T12:14:58.779585',
# 'owner_profile_id': 34,
# 'system_prompt_id': None,
# 'prompt_template': {'id': 1,
#  'name': 'Default Question Answering Template',
#  'variables': '["input_text","context"]',
#  'created_at': '2023-12-21T22:01:54.632367',
#  'template_string': 'Question: {input_text}\\n\\nContext: {context}\\n\\nAnswer:',
#  'owner_profile_id': None},
# 'model': {'id': 5,
#  'name': 'gpt-3.5-turbo',
#  'endpoint': 'https://api.openai.com/v1/chat/completions',
#  'revision': 'placeholder',
#  'created_at': '2024-02-06T17:01:21.408454',
#  'model_type': 'OpenAI',
#  'description': 'Returns a maximum of 4K output tokens.',
#  'owner_profile_id': None,
#  'external_model_config_id': None,
#  'instruction_template_cls': 'NoneType'}}

Running the evaluations as :

experiment_5 = run_eval(eval_df,
                        collection_name=COLLECTION_NAME,
                        recipe_id=recipe_gpt['id'],
                        num_docs=num_docs,
                        path=f"{COLLECTION_NAME}_{num_docs}_gpt.csv")

We observe :

and comparing all the 5 experiments as below :

GPT-3.5 surpassed Mistral-7B in all metrics! Notably, Experiment 5 exhibited the lowest occurrence of hallucination.

Conclusions

Let’s take a look at our results from all 5 experiments above

We still have a long way to go in improving the retrieval performance of RAG, as indicated by our generally poor results thus far. It might be beneficial to explore alternative embedding models or different retrieval strategies to address this issue.

The significant variations in Context Relevance suggest that certain questions may necessitate retrieving more documents than others. Therefore, investigating a dynamic retrieval strategy could be worthwhile.

Furthermore, there's ongoing exploration required on the generative aspect of RAG.
Modifying LLMs or prompts can substantially impact the overall quality of responses.

This iterative process demonstrates how, starting from scratch, continual evaluation and adjustments throughout experimentation can lead to the development of an enhanced RAG system.

Watch this workshop on YouTube

A workshop version of this article is available on YouTube. Follow along using our GitHub notebook.

Blog-Reading Chatbot with GPT-4o

David Myriel — Tue, 14 May 2024 16:31:49 +0000

In this tutorial, you will build a RAG system that combines blog content ingestion with the capabilities of semantic search. OpenAI’s GPT-4o LLM is powerful, but scaling its use requires us to supply context systematically.

RAG enhances the LLM’s generation of answers by retrieving relevant documents to aid the question-answering process. This setup showcases the integration of advanced search and AI language processing to improve information retrieval and generation tasks.

A notebook for this tutorial is available on GitHub.

Data Privacy and Sovereignty: RAG applications often rely on sensitive or proprietary internal data. Running the entire stack within your own environment becomes crucial for maintaining control over this data. Qdrant Hybrid Cloud deployed on Scaleway addresses this need perfectly, offering a secure, scalable platform that still leverages the full potential of RAG. Scaleway offers serverless Functions and serverless Jobs, both of which are ideal for embedding creation in large-scale RAG cases.

Components

Cloud Host: Scaleway on managed Kubernetes for compatibility with Qdrant Hybrid Cloud.
Vector Database: Qdrant Hybrid Cloud as the vector search engine for retrieval.
LLM: GPT-4o, developed by OpenAI is utilized as the generator for producing answers.
Framework: LangChain for extensive RAG capabilities.

> Langchain supports a wide range of LLMs, and GPT-4o is used as the main generator in this tutorial. You can easily swap it out for your preferred model that might be launched on your premises to complete the fully private setup. For the sake of simplicity, we used the OpenAI APIs, but LangChain makes the transition seamless.

Deploying Qdrant Hybrid Cloud on Scaleway

Scaleway Kapsule and Kosmos are managed Kubernetes services from Scaleway. They abstract away the complexities of managing and operating a Kubernetes cluster.

The primary difference being, Kapsule clusters are composed solely of Scaleway Instances. Whereas, a Kosmos cluster is a managed multi-cloud Kubernetes engine that allows you to connect instances from any cloud provider to a single managed Control-Plane.

To start using managed Kubernetes on Scaleway, follow the platform-specific documentation.
Once your Kubernetes clusters are up, you can begin deploying Qdrant Hybrid Cloud.

Prerequisites

To prepare the environment for working with Qdrant and related libraries, it’s necessary to install all required Python packages. This can be done using Poetry, a tool for dependency management and packaging in Python. The code snippet imports various libraries essential for the tasks ahead, including bs4 for parsing HTML and XML documents, langchain and its community extensions for working with language models and document loaders, and Qdrant for vector storage and retrieval. These imports lay the groundwork for utilizing Qdrant alongside other tools for natural language processing and machine learning tasks.

Qdrant will be running on a specific URL and access will be restricted by the API key. Make sure to store them both as environment variables as well:

export QDRANT_URL="https://qdrant.example.com"
export QDRANT_API_KEY="your-api-key"

Optional: Whenever you use LangChain, you can also configure LangSmith, which will help us trace, monitor and debug LangChain applications. You can sign up for LangSmith here.

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-api-key"
export LANGCHAIN_PROJECT="your-project"  # if not specified, defaults to "default"

Now you can get started:

import getpass
import os

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_qdrant import Qdrant
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

Set up the OpenAI API key:

os.environ["OPENAI_API_KEY"] = getpass.getpass()

Initialize the language model:


llm = ChatOpenAI(model="gpt-4o")

It is here that we configure both the Embeddings and LLM. You can replace this with your own models using Ollama or other services. Scaleway has some great L4 GPU Instances you can use for compute here.

Download and parse data

To begin working with blog post contents, the process involves loading and parsing the HTML content. This is achieved using urllib and BeautifulSoup, which are tools designed for such tasks. After the content is loaded and parsed, it is indexed using Qdrant, a powerful tool for managing and querying vector data. The code snippet demonstrates how to load, chunk, and index the contents of a blog post by specifying the URL of the blog and the specific HTML elements to parse. This step is crucial for preparing the data for further processing and analysis with Qdrant.

# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

Chunking data

When dealing with large documents, such as a blog post exceeding 42,000 characters, it’s crucial to manage the data efficiently for processing. Many models have a limited context window and struggle with long inputs, making it difficult to extract or find relevant information. To overcome this, the document is divided into smaller chunks. This approach enhances the model’s ability to process and retrieve the most pertinent sections of the document effectively.

In this scenario, the document is split into chunks using the RecursiveCharacterTextSplitter with a specified chunk size and overlap. This method ensures that no critical information is lost between chunks. Following the splitting, these chunks are then indexed into Qdrant—a vector database for efficient similarity search and storage of embeddings. The Qdrant.from_documents function is utilized for indexing, with documents being the split chunks and embeddings generated through OpenAIEmbeddings. The entire process is facilitated within an in-memory database, signifying that the operations are performed without the need for persistent storage, and the collection is named “lilianweng” for reference.

This chunking and indexing strategy significantly improves the management and retrieval of information from large documents, making it a practical solution for handling extensive texts in data processing workflows.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

vectorstore = Qdrant.from_documents(
    documents=splits, embedding=OpenAIEmbeddings(), location=":memory:", collection_name="lilianweng"
)

Retrieve and generate content

The vectorstore is used as a retriever to fetch relevant documents based on vector similarity. The hub.pull("rlm/rag-prompt") function is used to pull a specific prompt from a repository, which is designed to work with retrieved documents and a question to generate a response.

The format_docs function formats the retrieved documents into a single string, preparing them for further processing. This formatted string, along with a question, is passed through a chain of operations. Firstly, the context (formatted documents) and the question are processed by the retriever and the prompt. Then, the result is fed into a large language model (llm) for content generation. Finally, the output is parsed into a string format using StrOutputParser().

This chain of operations demonstrates a sophisticated approach to information retrieval and content generation, leveraging both the semantic understanding capabilities of vector search and the generative prowess of large language models.

Now, retrieve and generate data using relevant snippets from the blogL

retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Invoking the RAG Chain

rag_chain.invoke("What is Task Decomposition?")

Next steps:

We built a solid foundation for a simple chatbot, but there is still a lot to do. If you want to make the system production-ready, you should consider implementing the mechanism into your existing stack. We recommend

Our vector database can easily be hosted on Scaleway, our trusted Qdrant Hybrid Cloud partner. This means that Qdrant can be run from your Scaleway region, but the database itself can still be managed from within Qdrant Cloud’s interface. Both products have been tested for compatibility and scalability, and we recommend their managed Kubernetes service. Their French deployment regions e.g. France are excellent for network latency and data sovereignty. For hosted GPUs, try rendering with L4 GPU instances.

If you have any questions, feel free to ask on our Discord community.

What is RAG (Retrieval-Augmented Generation)?

Sabrina — Tue, 19 Mar 2024 16:47:57 +0000

Retrieval-augmented generation (RAG) integrates external information retrieval into the process of generating responses by Large Language Models (LLMs). It searches a database for information beyond its pre-trained knowledge base, significantly improving the accuracy and relevance of the generated responses.

Language models have exploded on the internet ever since ChatGPT came out, and rightfully so. They can write essays, code entire programs, and even make memes (though we’re still deciding on whether that's a good thing).

But as brilliant as these chatbots become, they still have limitations in tasks requiring external knowledge and factual information.

Yes, it can describe the honeybee's waggle dance in excruciating detail. But they become far more valuable if they can generate insights from any data that we provide, rather than just their original training data.

Since retraining those large language models from scratch costs millions of dollars and takes months, we need better ways to give our existing LLMs access to our custom data.

While you could be more creative with your prompts, it is only a short-term solution. LLMs can consider only a limited amount of text in their responses, known as a context window.

Some models like GPT-3 can see up to around 12 pages of text (that’s 4,096 tokens of context). That’s not good enough for most knowledge bases.

The image above shows how a basic RAG system works. Before forwarding the question to the LLM, we have a layer that searches our knowledge base for the "relevant knowledge" to answer the user query. Specifically, in this case, the spending data from the last month.

Our LLM can now generate a relevant non-hallucinated response about our budget.

As your data grows, you’ll need efficient ways to identify the most relevant information for your LLM's limited memory. This is where you’ll want a proper way to store and retrieve the specific data you’ll need for your query, without needing the LLM to remember it.

Vector databases store information as vector embeddings. This format supports efficient similarity searches to retrieve relevant data for your query. For example, Qdrant is specifically designed to perform fast, even in scenarios dealing with billions of vectors.

This article will focus on RAG systems and architecture. If you’re interested in learning more about vector search, we recommend the following articles: What is a Vector Database? and What are Vector Embeddings?.

RAG architecture

At its core, a RAG architecture includes the retriever and the generator. Let's start by understanding what each of these components does.

The Retriever

When you ask a question to the retriever, it uses similarity search to scan through a vast knowledge base of vector embeddings. It then pulls out the most relevant vectors to help answer that query. There are a few different techniques it can use to know what’s relevant:

How indexing works in RAG retrievers

The indexing process organizes the data into your vector database in a way that makes it easily searchable. This allows the RAG to access relevant information when responding to a query.

As shown in the image above, here’s the process:

Start with a loader that gathers documents containing your data. These documents could be anything from articles and books to web pages and social media posts.
Next, a splitter divides the documents into smaller chunks, typically sentences or paragraphs.
This is because RAG models work better with smaller pieces of text. In the diagram, these are document snippets.
Each text chunk is then fed into an embedding machine. This machine uses complex algorithms to convert the text into vector embeddings.

All the generated vector embeddings are stored in a knowledge base of indexed information. This supports efficient retrieval of similar pieces of information when needed.

Query vectorization

Once you have vectorized your knowledge base you can do the same to the user query. When the model sees a new query, it uses the same preprocessing and embedding techniques. This ensures that the query vector is compatible with the document vectors in the index.

Retrieval of relevant documents

When the system needs to find the most relevant documents or passages to answer a query, it utilizes vector similarity techniques. Vector similarity is a fundamental concept in machine learning and natural language processing (NLP) that quantifies the resemblance between vectors, which are mathematical representations of data points.

The system can employ different vector similarity strategies depending on the type of vectors used to represent the data:

Sparse vector representations

A sparse vector is characterized by a high dimensionality, with most of its elements being zero.

The classic approach is keyword search, which scans documents for the exact words or phrases in the query. The search creates sparse vector representations of documents by counting word occurrences and inversely weighting common words. Queries with rarer words get prioritized.

TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 are two classic related algorithms. They're simple and computationally efficient. However, they can struggle with synonyms and don't always capture semantic similarities.

If you’re interested in going deeper, refer to our article on Sparse Vectors.

Dense vector embeddings

This approach uses large language models like BERT to encode the query and passages into dense vector embeddings. These models are compact numerical representations that capture semantic meaning. Vector databases like Qdrant store these embeddings, allowing retrieval based on semantic similarity rather than just keywords using distance metrics like cosine similarity.

This allows the retriever to match based on semantic understanding rather than just keywords. So if I ask about "compounds that cause BO," it can retrieve relevant info about "molecules that create body odor" even if those exact words weren't used. We explain more about it in our What are Vector Embeddings article.

Hybrid search

However, neither keyword search nor vector search are always perfect. Keyword search may miss relevant information expressed differently, while vector search can sometimes struggle with specificity or neglect important statistical word patterns. Hybrid methods aim to combine the strengths of different techniques.

Some common hybrid approaches include:

Using keyword search to get an initial set of candidate documents. Next, the documents are re-ranked/re-scored using semantic vector representations.
Starting with semantic vectors to find generally topically relevant documents. Next, the documents are filtered/re-ranked e based on keyword matches or other metadata.
Considering both semantic vector closeness and statistical keyword patterns/weights in a combined scoring model.
Having multiple stages were different techniques. One example: start with an initial keyword retrieval, followed by semantic re-ranking, then a final re-ranking using even more complex models.

When you combine the powers of different search methods in a complementary way, you can provide higher quality, more comprehensive results. Check out our article on Hybrid Search if you’d like to learn more.

The Generator

With the top relevant passages retrieved, it's now the generator's job to produce a final answer by synthesizing and expressing that information in natural language.

The LLM is typically a model like GPT, BART or T5, trained on massive datasets to understand and generate human-like text. It now takes not only the query (or question) as input but also the relevant documents or passages that the retriever identified as potentially containing the answer to generate its response.

The retriever and generator don't operate in isolation. The image bellow shows how the output of the retrieval feeds the generator to produce the final generated response.

Where is RAG being used?

Because of their more knowledgeable and contextual responses, we can find RAG models being applied in many areas today, especially those who need factual accuracy and knowledge depth.

Real-World Applications:

Question answering: This is perhaps the most prominent use case for RAG models. They power advanced question-answering systems that can retrieve relevant information from large knowledge bases and then generate fluent answers.

Language generation: RAG enables more factual and contextualized text generation for contextualized text summarization from multiple sources

Data-to-text generation: By retrieving relevant structured data, RAG models can generate product/business intelligence reports from databases or describing insights from data visualizations and charts

Multimedia understanding: RAG isn't limited to text - it can retrieve multimodal information like images, video, and audio to enhance understanding. Answering questions about images/videos by retrieving relevant textual context.

Creating your first RAG chatbot with Langchain, Groq, and OpenAI

Are you ready to create your own RAG chatbot from the ground up? We have a video explaining everything from the beginning. Daniel Romero’s will guide you through:

Setting up your chatbot
Preprocessing and organizing data for your chatbot's use
Applying vector similarity search algorithms
Enhancing the efficiency and response quality

After building your RAG chatbot, you'll be able to evaluate its performance against that of a chatbot powered solely by a Large Language Model (LLM).

What’s next?

Have a RAG project you want to bring to life? Join our Discord community where we’re always sharing tips and answering questions on vector search and retrieval.

Learn more about how to properly evaluate your RAG responses: Evaluating Retrieval Augmented Generation - a framework for assessment.

Qdrant 1.8.0 - Major Performance Enhancements

David Myriel — Fri, 08 Mar 2024 16:41:12 +0000

Qdrant 1.8.0 is out!

This time around, we have focused on Qdrant's internals. Our goal was to optimize performance, so that your existing setup can run faster and save on compute. Here is what we've been up to:

Faster sparse vectors: Hybrid search is up to 16x faster now!
CPU resource management: You can allocate CPU threads for faster indexing.
Better indexing performance: We optimized text indexing on the backend.

Faster search with sparse vectors

Search throughput is now up to 16 times faster for sparse vectors. If you are using Qdrant for hybrid search, this means that you can now handle up to sixteen times as many queries. This improvement comes from extensive backend optimizations aimed at increasing efficiency and capacity.

What this means for your setup:

Query speed: The time it takes to run a search query has been significantly reduced.
Search capacity: Qdrant can now handle a much larger volume of search requests.
User experience: Results will appear faster, leading to a smoother experience for the user.
Scalability: You can easily accomodate rapidly growing users or an expanding dataset.

Sparse vectors benchmark

Performance results are publicly available for you to test. Qdrant's R&D developed a dedicated open-source benchmarking tool just to test sparse vector performance.

A real-life simulation of sparse vector queries was run against the NeurIPS 2023 dataset. All tests were done on an 8 CPU machine on Azure.

Latency (y-axis) has dropped significantly for queries. You can see the before/after here:

Figure 1: Dropping latency in sparse vector search queries across versions 1.7-1.8.

The colors within both scatter plots show the frequency of results. The red dots show that the highest concentration is around 2200ms (before) and 135ms (after). This tells us that latency for sparse vectors queries dropped by about a factor of 16. Therefore, the time it takes to retrieve an answer with Qdrant is that much shorter.

This performance increase can have a dramatic effect on hybrid search implementations. Read more about how to set this up.

FYI, sparse vectors were released in Qdrant v.1.7.0. They are stored using a different index, so first check out the documentation if you want to try an implementation.

CPU resource management

Indexing is Qdrant’s most resource-intensive process. Now you can account for this by allocating compute use specifically to indexing. You can assign a number CPU resources towards indexing and leave the rest for search. As a result, indexes will build faster, and search quality will remain unaffected.

This isn't mandatory, as Qdrant is by default tuned to strike the right balance between indexing and search. However, if you wish to define specific CPU usage, you will need to do so from config.yaml.

This version introduces a optimizer_cpu_budget parameter to control the maximum number of CPUs used for indexing.

Read more about config.yaml in the configuration file.

# CPU budget, how many CPUs (threads) to allocate for an optimization job.
optimizer_cpu_budget: 0

If left at 0, Qdrant will keep 1 or more CPUs unallocated - depending on CPU size.
If the setting is positive, Qdrant will use this exact number of CPUs for indexing.
If the setting is negative, Qdrant will subtract this number of CPUs from the available CPUs for indexing.

For most users, the default optimizer_cpu_budget setting will work well. We only recommend you use this if your indexing load is significant.

Our backend leverages dynamic CPU saturation to increase indexing speed. For that reason, the impact on search query performance ends up being minimal. Ultimately, you will be able to strike a the best possible balance between indexing times and search performance.

This configuration can be done at any time, but it requires a restart of Qdrant. Changing it affects both existing and new collections.

Note: This feature is not configurable on Qdrant Cloud.

Better indexing for text data

In order to minimize your RAM expenditure, we have developed a new way to index specific types of data. Please keep in mind that this is a backend improvement, and you won't need to configure anything.

Going forward, if you are indexing immutable text fields, we estimate a 10% reduction in RAM loads. Our benchmark result is based on a system that uses 64GB of RAM. If you are using less RAM, this reduction might be higher than 10%.

Immutable text fields are static and do not change once they are added to Qdrant. These entries usually represent some type of an attribute, description or a tag. Vectors associated with them can be indexed more efficiently, since you don’t need to re-index them anymore. Conversely, mutable fields are dynamic and can be modified after their initial creation. Please keep in mind that they will continue to require additional RAM.

This approach ensures stability in the vector search index, with faster and more consistent operations. We achieved this by setting up a field index which helps minimize what is stored. To improve search performance we have also optimized the way we load documents for searches with a text field index. Now our backend loads documents mostly sequentially and in increasing order.

Minor improvements and new features

Beyond these enhancements, Qdrant v1.8.0 adds and improves on several smaller features:

Order points by payload: In addition to searching for semantic results, you might want to retrieve results by specific metadata (such as price). You can now use Scroll API to order points by payload key.
Datetime support: We have implemented datetime support for the payload index. Prior to this, if you wanted to search for a specific datetime range, you would have had to convert dates to UNIX timestamps. (PR#3320)
Check collection existence: You can check whether a collection exists via the /exists endpoint to the /collections/{collection_name}. You will get a true/false response. (PR#3472).
Find points whose payloads match more than the minimal amount of conditions. We included the min_should match feature for a condition to be true (PR#3331).
Modify nested fields: We have improved the set_payload API, adding the ability to update nested fields (PR#3548).

Release notes

For more information, see our release notes.
Qdrant is an open source project. We welcome your contributions; raise issues, or contribute via pull requests!

RAG is Dead. Long Live RAG!

David Myriel — Wed, 28 Feb 2024 21:44:14 +0000

When Anthropic came out with a context window of 100K tokens, they said: “Vector search is dead. LLMs are getting more accurate and won’t need RAG anymore.”

Google’s Gemini 1.5 now offers a context window of 10 million tokens. Their supporting paper claims victory over accuracy issues, even when applying Greg Kamradt’s NIAH methodology.

It’s over. RAG must be completely obsolete now. Right?

No.

Larger context windows are never the solution. Let me repeat. Never. They require more computational resources and lead to slower processing times.

The community is already stress testing Gemini 1.5:

This is not surprising. LLMs require massive amounts of compute and memory to run. To cite Grant, running such a model by itself “would deplete a small coal mine to generate each completion”. Also, who is waiting 30 seconds for a response?

Context stuffing is not the solution

Relying on context is expensive, and it doesn’t improve response quality in real-world applications. Retrieval based on vector search offers much higher precision.

If you solely rely on an LLM to perfect retrieval and precision, you are doing it wrong.

A large context window makes it harder to focus on relevant information. This increases the risk of errors or hallucinations in its responses.

Google found Gemini 1.5 significantly more accurate than GPT-4 at shorter context lengths and “a very small decrease in recall towards 1M tokens”. The recall is still below 0.8.

We don’t think 60-80% is good enough. The LLM might retrieve enough relevant facts in its context window, but it still loses up to 40% of the available information.

The whole point of vector search is to circumvent this process by efficiently picking the information your app needs to generate the best response. A vector database keeps the compute load low and the query response fast. You don’t need to wait for the LLM at all.

Qdrant’s benchmark results are strongly in favor of accuracy and efficiency. We recommend that you consider them before deciding that an LLM is enough. Take a look at our open-source benchmark reports and try out the tests yourself.

Vector search in compound systems

The future of AI lies in careful system engineering. As per Zaharia et al., results from Databricks find that “60% of LLM applications use some form of RAG, while 30% use multi-step chains.”

Even Gemini 1.5 demonstrates the need for a complex strategy. When looking at Google’s MMLU Benchmark, the model was called 32 times to reach a score of 90.0% accuracy. This shows us that even a basic compound arrangement is superior to monolithic models.

As a retrieval system, a vector database perfectly fits the need for compound systems. Introducing them into your design opens the possibilities for superior applications of LLMs. It is superior because it’s faster, more accurate, and much cheaper to run.

The key advantage of RAG is that it allows an LLM to pull in real-time information from up-to-date internal and external knowledge sources, making it more dynamic and adaptable to new information. - Oliver Molander, CEO of IMAGINAI

Qdrant scales to enterprise RAG scenarios

People still don’t understand the economic benefit of vector databases. Why would a large corporate AI system need a stand-alone vector db like Qdrant? In our minds, this is the most important question. Let’s pretend that LLMs cease struggling with context thresholds altogether.

How much would all of this cost?

If you are running a RAG solution in an enterprise environment with petabytes of private data, your compute bill will be unimaginable. Let's assume 1 cent per 1K input tokens (which is the current GPT-4 Turbo pricing). Whatever you are doing, every time you go 100 thousand tokens deep, it will cost you $1.

That’s a buck a question.

According to our estimations, vector search queries are at least 100 million times cheaper than queries made by LLMs.

Conversely, the only up-front investment with vector databases is the indexing (which requires more compute). After this step, everything else is a breeze. Once setup, Qdrant easily scales via features like Multitenancy and Sharding. This lets you scale up your reliance on the vector retrieval process and minimize your use of the compute-heavy LLMs. As an optimization measure, Qdrant is irreplaceable.

Julien Simon from HuggingFace says it best:

RAG is not a workaround for limited context size. For mission-critical enterprise use cases, RAG is a way to leverage high-value, proprietary company knowledge that will never be found in public datasets used for LLM training. At the moment, the best place to index and query this knowledge is some sort of vector index. In addition, RAG downgrades the LLM to a writing assistant. Since built-in knowledge becomes much less important, a nice small 7B open-source model usually does the trick at a fraction of the cost of a huge generic model.

Long Live RAG

As LLMs continue to require enormous computing power, users will need to leverage vector search and RAG.

Our customers remind us of this fact every day. As a product, our vector database is highly scalable and business-friendly. We develop our features strategically to follow our company’s Unix philosophy.

We want to keep Qdrant compact, efficient and with a focused purpose. This purpose is to empower our customers to use it however they see fit.

When large enterprises release their generative AI into production, they need to keep costs under control, while retaining the best possible quality of responses. Qdrant has the tools to do just that.

Whether through RAG, Semantic Search, Dissimilarity Search, Recommendations or Multimodality - Qdrant will continue to journey on.

Enhance OpenAI Embeddings with Qdrant's Binary Quantization

Nirant — Fri, 23 Feb 2024 17:46:21 +0000

OpenAI Ada-003 embeddings are a powerful tool for natural language processing (NLP). However, the size of the embeddings are a challenge, especially with real-time search and retrieval. In this article, we explore how you can use Qdrant's Binary Quantization to enhance the performance and efficiency of OpenAI embeddings.

In this post, we discuss:

The significance of OpenAI embeddings and real-world challenges.
Qdrant's Binary Quantization, and how it can improve the performance of OpenAI embeddings
Results of an experiment that highlights improvements in search efficiency and accuracy
Implications of these findings for real-world applications
Best practices for leveraging Binary Quantization to enhance OpenAI embeddings

You can also try out these techniques as described in Binary Quantization OpenAI, which includes Jupyter notebooks.

New OpenAI Embeddings: Performance and Changes

As the technology of embedding models has advanced, demand has grown. Users are looking more for powerful and efficient text-embedding models. OpenAI's Ada-003 embeddings offer state-of-the-art performance on a wide range of NLP tasks, including those noted in MTEB and MIRACL.

These models include multilingual support in over 100 languages. The transition from text-embedding-ada-002 to text-embedding-3-large has led to a significant jump in performance scores (from 31.4% to 54.9% on MIRACL).

Matryoshka Representation Learning

The new OpenAI models have been trained with a novel approach called "Matryoshka Representation Learning". Developers can set up embeddings of different sizes (number of dimensions). In this post, we use small and large variants. Developers can select embeddings which balances accuracy and size.

Here, we show how the accuracy of binary quantization is quite good across different dimensions -- for both the models.

Enhanced Performance and Efficiency with Binary Quantization

By reducing storage needs, you can scale applications with lower costs. This addresses a critical challenge posed by the original embedding sizes. Binary Quantization also speeds the search process. It simplifies the complex distance calculations between vectors into more manageable bitwise operations, which supports potentially real-time searches across vast datasets.

The accompanying graph illustrates the promising accuracy levels achievable with binary quantization across different model sizes, showcasing its practicality without severely compromising on performance. This dual advantage of storage reduction and accelerated search capabilities underscores the transformative potential of Binary Quantization in deploying OpenAI embeddings more effectively across various real-world applications.

The efficiency gains from Binary Quantization are as follows:

Reduced storage footprint: It helps with large-scale datasets. It also saves on memory, and scales up to 30x at the same cost.
Enhanced speed of data retrieval: Smaller data sizes generally leads to faster searches.
Accelerated search process: It is based on simplified distance calculations between vectors to bitwise operations. This enables real-time querying even in extensive databases.

Experiment Setup: OpenAI Embeddings in Focus

To identify Binary Quantization's impact on search efficiency and accuracy, we designed our experiment on OpenAI text-embedding models. These models, which capture nuanced linguistic features and semantic relationships, are the backbone of our analysis. We then delve deep into the potential enhancements offered by Qdrant's Binary Quantization feature.

This approach not only leverages the high-caliber OpenAI embeddings but also provides a broad basis for evaluating the search mechanism under scrutiny.

Dataset

The research employs 100K random samples from the OpenAI 1M 1M dataset, focusing on 100 randomly selected records. These records serve as queries in the experiment, aiming to assess how Binary Quantization influences search efficiency and precision within the dataset. We then use the embeddings of the queries to search for the nearest neighbors in the dataset.

Parameters: Oversampling, Rescoring, and Search Limits

For each record, we run a parameter sweep over the number of oversampling, rescoring, and search limits. We can then understand the impact of these parameters on search accuracy and efficiency. Our experiment was designed to assess the impact of Binary Quantization under various conditions, based on the following parameters:

Oversampling: By oversampling, we can limit the loss of information inherent in quantization. This also helps to preserve the semantic richness of your OpenAI embeddings. We experimented with different oversampling factors, and identified the impact on the accuracy and efficiency of search. Spoiler: higher oversampling factors tend to improve the accuracy of searches. However, they usually require more computational resources.
Rescoring: Rescoring refines the first results of an initial binary search. This process leverages the original high-dimensional vectors to refine the search results, always improving accuracy. We toggled rescoring on and off to measure effectiveness, when combined with Binary Quantization. We also measured the impact on search performance.
Search Limits: We specify the number of results from the search process. We experimented with various search limits to measure their impact the accuracy and efficiency. We explored the trade-offs between search depth and performance. The results provide insight for applications with different precision and speed requirements.

Through this detailed setup, our experiment sought to shed light on the nuanced interplay between Binary Quantization and the high-quality embeddings produced by OpenAI's models. By meticulously adjusting and observing the outcomes under different conditions, we aimed to uncover actionable insights that could empower users to harness the full potential of Qdrant in combination with OpenAI's embeddings, regardless of their specific application needs.

Results: Binary Quantization's Impact on OpenAI Embeddings

To analyze the impact of rescoring (True or False), we compared results across different model configurations and search limits. Rescoring sets up a more precise search, based on results from an initial query.

Rescoring

Here are some key observations, which analyzes the impact of rescoring (True or False):

Significantly Improved Accuracy:
- Across all models and dimension configurations, enabling rescoring (True) consistently results in higher accuracy scores compared to when rescoring is disabled (False).
- The improvement in accuracy is true across various search limits (10, 20, 50, 100).
Model and Dimension Specific Observations:
- For the text-embedding-3-large model with 3072 dimensions, rescoring boosts the accuracy from an average of about 76-77% without rescoring to 97-99% with rescoring, depending on the search limit and oversampling rate.
- The accuracy improvement with increased oversampling is more pronounced when rescoring is enabled, indicating a better utilization of the additional binary codes in refining search results.
- With the text-embedding-3-small model at 512 dimensions, accuracy increases from around 53-55% without rescoring to 71-91% with rescoring, highlighting the significant impact of rescoring, especially at lower dimensions.
- For higher dimension models (such as text-embedding-3-large with 3072 dimensions), In contrast, for lower dimension models (such as text-embedding-3-small with 512 dimensions), the incremental accuracy gains from increased oversampling levels are less significant, even with rescoring enabled. This suggests a diminishing return on accuracy improvement with higher oversampling in lower dimension spaces.
Influence of Search Limit:
- The performance gain from rescoring seems to be relatively stable across different search limits, suggesting that rescoring consistently enhances accuracy regardless of the number of top results considered.

In summary, enabling rescoring dramatically improves search accuracy across all tested configurations. It is crucial feature for applications where precision is paramount. The consistent performance boost provided by rescoring underscores its value in refining search results, particularly when working with complex, high-dimensional data like OpenAI embeddings. This enhancement is critical for applications that demand high accuracy, such as semantic search, content discovery, and recommendation systems, where the quality of search results directly impacts user experience and satisfaction.

Dataset Combinations

For those exploring the integration of text embedding models with Qdrant, it's crucial to consider various model configurations for optimal performance. The dataset combinations defined above illustrate different configurations to test against Qdrant. These combinations vary by two primary attributes:

Model Name: Signifying the specific text embedding model variant, such as "text-embedding-3-large" or "text-embedding-3-small". This distinction correlates with the model's capacity, with "large" models offering more detailed embeddings at the cost of increased computational resources.
Dimensions: This refers to the size of the vector embeddings produced by the model. Options range from 512 to 3072 dimensions. Higher dimensions could lead to more precise embeddings but might also increase the search time and memory usage in Qdrant.

Optimizing these parameters is a balancing act between search accuracy and resource efficiency. Testing across these combinations allows users to identify the configuration that best meets their specific needs, considering the trade-offs between computational resources and the quality of search results.

dataset_combinations = [
    {
        "model_name": "text-embedding-3-large",
        "dimensions": 3072,
    },
    {
        "model_name": "text-embedding-3-large",
        "dimensions": 1024,
    },
    {
        "model_name": "text-embedding-3-large",
        "dimensions": 1536,
    },
    {
        "model_name": "text-embedding-3-small",
        "dimensions": 512,
    },
    {
        "model_name": "text-embedding-3-small",
        "dimensions": 1024,
    },
    {
        "model_name": "text-embedding-3-small",
        "dimensions": 1536,
    },
]

Exploring Dataset Combinations and Their Impacts on Model Performance

The code snippet iterates through predefined dataset and model combinations. For each combination, characterized by the model name and its dimensions, the corresponding experiment's results are loaded. These results, which are stored in JSON format, include performance metrics like accuracy under different configurations: with and without oversampling, and with and without a rescore step.

Following the extraction of these metrics, the code computes the average accuracy across different settings, excluding extreme cases of very low limits (specifically, limits of 1 and 5). This computation groups the results by oversampling, rescore presence, and limit, before calculating the mean accuracy for each subgroup.

After gathering and processing this data, the average accuracies are organized into a pivot table. This table is indexed by the limit (the number of top results considered), and columns are formed based on combinations of oversampling and rescoring.

import pandas as pd

for combination in dataset_combinations:
    model_name = combination["model_name"]
    dimensions = combination["dimensions"]
    print(f"Model: {model_name}, dimensions: {dimensions}")
    results = pd.read_json(f"../results/results-{model_name}-{dimensions}.json", lines=True)
    average_accuracy = results[results["limit"] != 1]
    average_accuracy = average_accuracy[average_accuracy["limit"] != 5]
    average_accuracy = average_accuracy.groupby(["oversampling", "rescore", "limit"])[
        "accuracy"
    ].mean()
    average_accuracy = average_accuracy.reset_index()
    acc = average_accuracy.pivot(
        index="limit", columns=["oversampling", "rescore"], values="accuracy"
    )
    print(acc)

Impact of Oversampling

You can use oversampling in machine learning to counteract imbalances in datasets.
It works well when one class significantly outnumbers others. This imbalance
can skew the performance of models, which favors the majority class at the
expense of others. By creating additional samples from the minority classes,
oversampling helps equalize the representation of classes in the training dataset, thus enabling more fair and accurate modeling of real-world scenarios.

The screenshot showcases the effect of oversampling on model performance metrics. While the actual metrics aren't shown, we expect to see improvements in measures such as precision, recall, or F1-score. These improvements illustrate the effectiveness of oversampling in creating a more balanced dataset. It allows the model to learn a better representation of all classes, not just the dominant one.

Without an explicit code snippet or output, we focus on the role of oversampling in model fairness and performance. Through graphical representation, you can set up before-and-after comparisons. These comparisons illustrate the contribution to machine learning projects.

Leveraging Binary Quantization: Best Practices

We recommend the following best practices for leveraging Binary Quantization to enhance OpenAI embeddings:

Embedding Model: Use the text-embedding-3-large from MTEB. It is most accurate among those tested.
Dimensions: Use the highest dimension available for the model, to maximize accuracy. The results are true for English and other languages.
Oversampling: Use an oversampling factor of 3 for the best balance between accuracy and efficiency. This factor is suitable for a wide range of applications.
Rescoring: Enable rescoring to improve the accuracy of search results.
RAM: Store the full vectors and payload on disk. Limit what you load from memory to the binary quantization index. This helps reduce the memory footprint and improve the overall efficiency of the system. The incremental latency from the disk read is negligible compared to the latency savings from the binary scoring in Qdrant, which uses SIMD instructions where possible.

Want to discuss these findings and learn more about Binary Quantization? Join our Discord community.

Learn more about how to boost your vector search speed and accuracy while reducing costs: Binary Quantization.

DEV Community: Qdrant

Relevance Feedback in Informational Retrieval

Dismantling the Relevance Feedback

Pseudo-Relevance Feedback (PRF)

Binary Relevance Feedback

Re-scored Relevance Feedback

Has the Problem Already Been Solved?

Two Ways to Approach the Problem

Query Refinement

Query As Text

Query As Vector

Gradient Descent-Based Methods

Similarity Scoring

So, what are the takeaways?

Introducing Gridstore: Qdrant's Custom Key-Value Store

Why We Built Our Own Storage Engine

In this article, you’ll learn about:

Gridstore Architecture: Three Main Components

1. The Data Layer for Fast Retrieval

2. The Mask Layer Reuses Space

3. The Gaps Layer for Effective Updates

Gridstore in Production: Maintaining Data Integrity

When Things Break in Real Life

Stability Through Idempotency: Recovering With WAL

The Grand Solution: Lazy Updates

How We Tested the Final Product

First... Model Testing

Crash Testing: Can Gridstore Handle the Pressure?

Testing Gridstore Performance: Benchmarks

The results speak for themselves:

End-to-End Benchmarking

- A sparse vector

Additional configuration:

3. Finally, we gathered request latency metrics for analysis.

Final Result

Trying Out Gridstore

What is Agentic RAG? Building Agents with Qdrant

Agentic RAG: Combining RAG with Agents

Solving Information Retrieval Problems with LLMs

Where are Agents Used?

Which Framework is Best?

LangGraph

CrewAI

AutoGen

OpenAI Swarm

The winner?

Building Agentic RAG with Qdrant

Further Reading

What is Vector Quantization?

1. What is Scalar Quantization?

2. What is Binary Quantization?

3. What is Product Quantization?

Rescoring, Oversampling, and Reranking

1. Initial Quantized Search

2. Oversampling

3. Rescoring with Original Vectors

4. Reranking

Distributing Resources Between Disk & Memory

Speeding Up Rescoring with io_uring

Performance of Quantized vs. Non-Quantized Data

Switching Between Quantization Methods

Wrapping Up

Learn More

A Complete Guide to Filtering in Vector Search

What is filtering?

What you will learn in this guide:

Remember to run all tutorial code in Qdrant's Dashboard

Qdrant's approach to filtering

Qdrant's approach vs traditional filtering methods

Pre-filtering

Post-filtering

Basic filtering example: ecommerce and laptops

Scrolling instead of searching

Available filtering conditions

Filtering clauses to remember

Advanced filtering example: dinosaur diets

Other creative uses for filters

Filtering with the payload index

Why should you index metadata?

Payload indexing helps evaluate the most restrictive filters