DEV Community: Jianyang Gao

TurboQuant and RaBitQ: What the Public Story Gets Wrong

Jianyang Gao — Tue, 31 Mar 2026 15:07:18 +0000

Hi everyone. My name is Jianyang Gao. I am currently a postdoctoral researcher at ETH Zurich, and I am the first author of the RaBitQ line of work.

In Google Research's paper accepted to ICLR 2026 in January 2026, "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate," there are serious problems in its description of the prior RaBitQ vector quantization method, its comparison of theoretical results, and its experimental comparison with RaBitQ. I will explain the details below. We explicitly pointed out these problems by email before the TurboQuant paper was submitted to ICLR 2026. The TurboQuant team explicitly acknowledged that they were aware of them, but chose not to correct them. The paper was then accepted to ICLR 2026 and subsequently promoted at large scale through Google's official channels, reaching tens of millions of views on social media.

We are speaking publicly now because once an inaccurate academic narrative spreads widely, the cost of correcting it only becomes higher.

Background: What is RaBitQ?

The RaBitQ papers listed below are the main outcome of my PhD research at Nanyang Technological University (NTU Singapore), under the supervision of Associate Professor Cheng Long. The work was published in 2024. It proposed a high-dimensional vector quantization method and theoretically proved that it achieves the asymptotically optimal error bound established in a top theoretical computer science paper (Alon-Klartag, FOCS 2017).

RaBitQ (arXiv:2405.12497, May 2024, later published at SIGMOD 2024)
Extended version (arXiv:2409.09913, September 2024, later published at SIGMOD 2025)

One of the key ideas of RaBitQ is to apply a random rotation to the input vector before quantization, that is, a random rotation / Johnson-Lindenstrauss transform. RaBitQ utilizes the properties of random rotation to perform vector quantization, and it achieves the optimal theoretical error bound.

Problem 1 in TurboQuant: Systematically avoiding the methodological similarity between TurboQuant and the prior RaBitQ method

RaBitQ and TurboQuant have a direct structural relationship at the method level. Both apply a random rotation (a Johnson-Lindenstrauss transform) to the input vector before quantization. This is the most central and closest overlap in the design of the two methods.

In their reply to a reviewer on the ICLR OpenReview platform, the TurboQuant authors described their own method as follows:

"We achieve this by first normalizing the vectors by their l2 norm and then applying a random rotation to ensure the entries of the vectors will have a beta distribution post rotation."

However, neither in that response, nor in the method description in the TurboQuant paper, nor anywhere else in the paper, do they directly state that this structure is the same as the one used in RaBitQ. This omission occurred in the following context:

In January 2025, several months before the TurboQuant paper appeared on arXiv, the second author of TurboQuant, Majid Daliri, proactively contacted us and asked for help debugging his own Python version translated from our RaBitQ C++ implementation. He described in detail the steps he had taken, the code snippets he used, and the specific errors he encountered. This shows that the TurboQuant team had a detailed understanding of the technical details of RaBitQ. Yet in the arXiv version they released in April 2025, and again in the version they submitted to ICLR 2026 in September 2025, they described RaBitQ as grid-based PQ while omitting the core random rotation step in RaBitQ. An ICLR reviewer independently pointed this out in the review, writing: "RaBitQ and variants are similar to TurboQuant in that they all use random projection," and explicitly requested a fuller discussion and comparison. Even so, in the final ICLR version, the TurboQuant authors not only failed to add any real discussion of RaBitQ, but actually moved their already incomplete description of RaBitQ out of the main text and into the appendix.

Because of this, in March 2026 we emailed all TurboQuant authors and raised the issue again, together with a request for correction. In response, the TurboQuant authors refused this request on the grounds that:

"The use of random rotation and Johnson-Lindenstrauss transformations has become a standard technique in the field, and it is not feasible for us to cite every method that employs them."

We believe this response deflects the real issue. RaBitQ is not just one of many unrelated methods using a generic idea. Under the same problem setting, it is the concrete prior work that first combined random rotations (Johnson-Lindenstrauss transforms) with vector quantization and established optimal theoretical guarantees. RaBitQ should therefore be described accurately in the paper, and its relationship to TurboQuant should be discussed explicitly.

Problem 2 in TurboQuant: Mischaracterizing RaBitQ's theoretical results

Without providing any supporting argument, the TurboQuant paper characterizes RaBitQ's theoretical guarantees as "suboptimal." The paper states:

"While the paper's theoretical guarantees are suboptimal, likely due to loose analysis -- as practical performance surpasses theoretical bounds"

This sentence directly labels RaBitQ's theoretical guarantees as "suboptimal" and attributes that to "loose analysis." But the paper provides no derivation, comparison, or evidence to justify this claim.

The fact is that in Theorem 3.2 of the extended RaBitQ paper (arXiv:2409.09913), we already gave a rigorous proof that RaBitQ achieves the asymptotically optimal error bound established in the top theoretical computer science paper of Alon and Klartag (FOCS 2017). Because of this result, we were invited to present it at a workshop affiliated with FOCS, one of the top conferences in theoretical computer science.

For this reason, in May 2025 we had multiple rounds of detailed technical email exchanges with the second author of TurboQuant, Majid Daliri, and clarified point by point where the TurboQuant team's reading of our theoretical result was wrong. In those emails, Majid Daliri explicitly stated that he had communicated these discussions to all co-authors.

However, throughout the later process in which TurboQuant was submitted to ICLR 2026, reviewed, accepted, and then broadly promoted, this incorrect characterization of RaBitQ's theoretical guarantee was never corrected.

An unsupported claim that remained in the formally published TurboQuant paper even after the original authors pointed out the error in detail, and even after the TurboQuant team explicitly knew about it, goes beyond the category of an ordinary mistake.

Problem 3 in TurboQuant: Deliberately creating an unfair experimental setup

The TurboQuant paper tested RaBitQ using a degraded implementation and a single-core CPU with multithreading disabled, while testing TurboQuant on an A100 GPU. The quantization speed reported for RaBitQ in TurboQuant is several orders of magnitude slower than the actual speed of our open-source implementation.

In an email from May 2025, Majid Daliri himself explained where this gap came from:

"we were using a single-core CPU instance, and multiprocessing was indeed disabled [...] we weren't fully utilizing parallelism, which explains why it was significantly slower"

Our official RaBitQ code was already publicly available when the paper first appeared on arXiv, both in May 2024 and in September 2024, and it used multithreaded parallelism by default. Moreover, in his January 2025 emails, Majid Daliri also stated that he had successfully run RaBitQ for testing, but the version he used for the experiments was still his own translated Python implementation. This means that the speed numbers reported for RaBitQ in the TurboQuant paper were built on top of two systematic sources of unfairness:

They used their own translated Python code instead of our open-source C++ implementation.
They evaluated RaBitQ on a single-core CPU with multithreading disabled, while evaluating TurboQuant on an NVIDIA A100 GPU.

Neither of these two points was fully disclosed in the paper. What readers see is the conclusion that RaBitQ is slower than TurboQuant by several orders of magnitude. What they are not told is that this conclusion is built on deliberately constructed unfair experimental conditions.

Full timeline of events

May. 2024: The RaBitQ paper was posted on arXiv, with source code released at the same time. It was later published at SIGMOD 2024.
Sep. 2024: The extended RaBitQ paper was posted on arXiv, with source code released at the same time. It was later published at SIGMOD 2025.
Jan. 2025: TurboQuant second author Majid Daliri contacted us and asked for help debugging a Python implementation of RaBitQ.
Apr. 2025: The TurboQuant paper was posted on arXiv.
May. 2025: We emailed Majid Daliri about the differences in experimental setup and clearly explained why RaBitQ's theoretical guarantees are optimal. Majid Daliri said he had informed all authors, but after we asked them to correct the factual errors in TurboQuant, he stopped replying.
Nov. 2025: We discovered that the TurboQuant paper had been submitted to ICLR 2026 and that the factual errors in the paper still had not been corrected. We therefore contacted the ICLR 2026 PC Chairs, but received no response.
Jan. 2026: The TurboQuant paper was accepted to ICLR 2026.
Mar. 2026: The TurboQuant team continued to promote the paper through Google's official channels, and related social media views reached tens of millions.
Mar. 2026: We formally emailed all TurboQuant authors, explained the three factual problems above, and requested corrections and clarifications. To date, we have only received a generic response from the TurboQuant first author, Amir Zandieh, who promised to address problems 2 and 3 but refused to address problem 1, namely the need to discuss the technical similarity between TurboQuant and RaBitQ. In addition, they were only willing to make any such corrections after the official ICLR 2026 conference had concluded.

What we have already done

Posted a public comment on ICLR OpenReview: https://openreview.net/forum?id=tO3ASKZlok
Submitted a formal complaint again to the ICLR General Chairs, PC Chairs, and Code and Ethics Chairs, together with a full evidence package

What we will do next

Release a detailed technical report on TurboQuant and RaBitQ on arXiv
Consider raising the matter further with relevant institutions

Final remarks

Our goal in raising these issues is to ensure that the public academic record accurately reflects the real relationship among these methods. Once a paper is pushed to the public by Google with tens of millions of impressions, the inaccurate narrative in that paper does not need to be actively propagated. If it is left uncorrected, it will become consensus by default. That is why we chose to document this publicly.

We also sincerely ask everyone to help more people understand the problems behind the TurboQuant paper. We believe that the truth becomes clearer through open debate.

Extended RaBitQ: an Optimized Scalar Quantization Method

Jianyang Gao — Thu, 12 Dec 2024 15:18:54 +0000

The cover is generated by DALL-E.

Approximate nearest neighbor (ANN) search in the high-dimensional space, a.k.a., vector search, is an increasingly important area due to its surging applications in vector databases, large language models (LLMs), and retrieval-augmented generation (RAG). However, to support various applications with low latency, a system often needs to host millions or even billions of vectors in RAM, leading to high costs, for instance, when a service is deployed on clouds. Quantization, a thread of technologies applied for vector compression, thus, performs increasingly critical roles in reducing the space consumption as well as the costs of services.

In the our previous post, we illustrated the key insights behind our research, the RaBitQ algorithm, which works as an optimized approach to binary quantization. It achieves significant improvement on the accuracy in practice and provides an asymptotically optimal theoretical error bound. However, the original RaBitQ paper only supports binary quantization (1-bit quantization). It remains unclear how it can utilize more bits (e.g., 2-bit, 3-bit or 4-bit per dimension) to achieve higher accuracy. In this post, we will introduce our extended version of RaBitQ, which presents a new strategy for minimizing the error when more bits are used. This allows the RaBitQ method to support the quantization of an arbitrary compression rate. In particular, it can provide an optimized approach for scalar quantization which has the potential to replace the existing ones in a system seamlessly:

Its accuracy is dominantly better than the state-of-the-art variant of scalar quantization under the same compression rates.
Its computation during querying is exactly the same as the scalar quantization.

The paper and code was released about three months ago at Sep-2024. To better understand this blog, it is suggested to first read our last post.

Codebook: From 1-bit per dimension to multiple-bit per dimension

Recall that in the RaBitQ method, we construct a quantization codebook by taking the vertices of a cube nested in the unit sphere, which is illustrated as follows.

Let $D$ be the dimensionality of a vector. Each vector in the cube corresponds to a bi-valued vector, i.e., all the coordinates of a vector are either $- \frac{1}{D}$ or $+ \frac{1}{D}$ . Thus, the codebook in total contains $2^{D}$ vectors and each vector in the codebook can be represented as a $D$ -bit binary string.

However, when more bits are used per dimension, we cannot easily find a codebook which are nested in the unit sphere: when each dimension has more than 1 bit, the codebook corresponds to vectors on a grid. In the following figure, we plot the codebook of 2-bit quantization in the 2-dimensional space as an example, where each empty blue point represents a vector in the codebook.

Recall that as is discussed in our previous post, RaBitQ achieves accurate distance estimation because it designs an accurate estimator. This estimator depends on the property that the codebook consists of unit vectors. However, in the current example, the property does not hold.

Finding the Nearest Vector on a Normalized Codebook

To resolve this issue, a natural idea is to map the codebook onto the unit sphere by normalization, which is illustrated as follows. The green circle visualizes the 2-dimensional unit sphere and the red points represent the normalized vectors in the codebook.

Performing this operation resolves the aforementioned issue of estimators. However, this causes extra challenges: for a data vector, we cannot easily find its nearest vector in the codebook.

When the codebook is formed of vectors on grids (the blue empty points). We can easily perform rounding (i.e., scalar quantization) for each dimension to find the nearest vector.
When the codebook is normalized (the red full points), the result of rounding is not necessarily the vector that best approximates the data vector.

In the following figure, we provide such an example. The purple triangle represents the data vector X, which we want to quantize. Here in the codebook, vector A is the one that is the nearest to X in the codebook, which means that it is also the result if we perform rounding (scalar quantization) on X. However, when the codebook is normalized, the vector B is the one that best approximates the vector X: the corresponding red point of vector B is closer to the vector X.

So the question is how we can find the vector that best approximates a data vector after the normalization of the codebook. We find that although the optimal vector might not be found by directly performing rounding on a data vector, if we rescale the vector and perform rounding, the optimal vector can be found. Let us continue with the example. In the figure below, we plot a vector tX with another purple triangle. Here the rescaling factor t is a positive number.

We can see from the figure that, when the data vector X is rescaled to the location of tX, its nearest vector (which can be found by rounding) is B: after rescaling the data vector and perform rounding, we successfully find the optimal vector in the codebook that best approximates it.

But, is it a special case? Is it true for any possible vector? Through rigorous mathematical proofs (see Lemma 3.1 in our paper), we have discovered that for any data vector, there is always a rescaling factor such that, by performing rounding (scalar quantization) on the rescaled vector, we can find the optimal vector in the codebook.

Based on this lemma, to find the optimal vector, we can try different rescaling factors. For every rescaling factor, we compute its nearest vector by rounding and compute the similarity/error produced in this case. Finally, by comparing the errors, we can find the optimal rescaling factor and the vector that best approximates our data vector. We refer readers to the Section 3.2 in our paper for more detailed strategy of trying different factors. It is worth highlighting that this algorithm guarantees to exactly find the optimal vector in the codebook.

Another Equivalent Way of Understanding

Besides rescaling the data vector, we note that the algorithm can also be equivalently understood from the another perspective: rescaling codebooks. Let us still consider quantizing the vector X.

In the figure above, both the data vector and the codebook are not rescaled. In this case, the nearest vector in the codebook of X is A and it produces a relatively large error.

When we rescale the codebook to the location indicated by the figure above, the nearest vector of X in the codebook becomes B and the error (distances between X and the rescaled B) is much smaller. Thus, our algorithm can also be equivalently explained in the following way: it performs scalar quantization by trying different parameters on a per-vector basis, computes the quantization error produced under every certain parameter and selects the optimal parameter and vector in the codebook to minimize the error. Mathematically, rescaling a data vector by a factor of t is equivalent to rescaling a codebook by a factor of 1/t. In addition, it is also worth noting that the original RaBitQ, which quantizes every dimension to 1-bit, can be regarded as a special case of the extended technology.

Summary

Based on the aforementioned technology, we resolve the key question: how we can find the best approximation of a vector when multiple bits are available in every dimension. In intuition, our strategy is to perform rounding (i.e., scalar quantization) by trying different parameters on a per-vector basis, compute the quantization error produced under every certain parameter and select the optimal parameter and vector. Based on this strategy and the estimator discussed in our previous posts, our method achieves the following results.
Theoretically, we prove that the extended version of RaBitQ provides an asymptotically optimal error bound in terms of the trade-off between space and accuracy. To our knowledge, RaBitQ is also the first practical algorithm that achieves the optimality.
Empirically, on all the tested datasets, when quantizing a vector to 2-bit per dimension, RaBitQ is more accurate than the state-of-the-art variant of scalar quantization by orders of magnitude. From 3-bit and above, RaBitQ still brings significant and consistent improvement.
In addition, we note that the unique error bound of RaBitQ not only promises the stable accuracy across different datasets, but also opens a broader range of opportunities for optimizing vector search. For example, based on the error bound, we can check whether a candidate is unlikely to become the nearest neighbor: if the lower bound of its approximate distance is larger than the distances of the current nearest neighbor, it would be unlikely to be the answer. We refer readers to our paper and code repos for more experimental results and technologies.

Since its release in early this year, RaBitQ has been adopted in several real-world systems. For example, TensorChord adopts RaBitQ in their cost-effective vector search systems and has published a detailed blog explaining how to optimize this algorithm step by step in Rust. Additionally, Elastic integrated our RaBitQ algorithm into a feature, which they call "BBQ". Their empirical evaluation highlights the breakthrough performance made by our RaBitQ algorithm. In September 2024, we released an updated version of RaBitQ with an extension that enhances scalar quantization. We look forward to seeing this extension further drive performance improvements in production systems.

Quantization in The Counterintuitive High-Dimensional Space

Jianyang Gao — Sun, 08 Dec 2024 12:39:18 +0000

The cover is generated by DALL-E.

In recent years, my research has focused on algorithms for approximate nearest neighbor (ANN) search in high-dimensional spaces, which is also known as vector search. This area has become increasingly significant due to its applications in fields like vector databases, large language models (LLMs), and retrieval-augmented generation (RAG). High-dimensional spaces, however, are full of counterintuitive phenomena that presents unique challenges for algorithm design. Through my research journey, I have learned that these phenomena can either pose obstacles or offer valuable opportunities, depending on how well they are understood and applied.

In this post, I will share a famous counterintuitive phenomenon in high-dimensional spaces, explain its underlying principles, and demonstrate how it can be utilized to improve the accuracy of quantization algorithms. This discussion aims to provide intuitive insights and complementary explanations to our research on RaBitQ and its extensions, which introduce optimized approaches to binary and scalar quantization, respectively. The codes for RaBitQ and its extensions have been publicly available since earlier this year (2024).

A Counterintuitive Phenomenon

Designing algorithms for high-dimensional spaces presents unique challenges. One major bottleneck is that the intuition based on our everyday experience in three-dimensional space often fails in higher dimensions. High-dimensional probability theory offers valuable insights into the structure of these spaces, but the heavy reliance on complex mathematics can sometimes make it difficult to grasp intuitively. In this post, rather than diving straight into the math, let us explore some simple examples to build an intuitive understanding of the counterintuitive facts in high-dimensional spaces.

Let us begin by considering an arbitrary unit vector (a vector with a length of 1), denoted as $x$ . Suppose the following scenario: we are interested in the value of its projection onto a certain vector—let's take the first coordinate vector for simplicity—represented as $x [0]$ . However, for some reason, we do not have direct access to this value and instead seek a rough estimate of its range without performing any computation. Since the vector $x$ has a length of 1, and without any additional information, the only conclusion we can draw is that $x [0]$ must lie within the range $[- 1, 1]$ .

Next, consider a random unit vector uniformly distributed on the unit sphere, meaning it has equal probability of appearing at any point on the sphere's surface. To gain insight, let us generate some random vectors following this distribution and examine the value of $x [0]$ . The empirical distribution of $x [0]$ is plotted as follows.

On the left side, we present the case where the vector has 3 dimensions. This scenario appears intuitive: for a unit vector (i.e., with length of 1), $x [0]$ can take any value within the range $[- 1, 1]$ .
On the right side, however, when the vector has 1000 dimensions, the situation becomes quite unusual. While the possible range of $x [0]$ remains $[- 1, 1]$ , the figure reveals a striking phenomenon: the actual range of $x [0]$ is narrowly concentrated around 0.

This unexpected behavior underscores a fundamental difference between low-dimensional and high-dimensional spaces: in high-dimensional spaces, randomness can give rise to surprising certainty.

But Why?

This phenomenon is explained by the Concentration of Measure, a fundamental principle in high-dimensional geometry and probability. To better understand this behavior, let’s continue exploring the example and develop some intuitive explanations for the phenomenon.

Recall that $x$ is a unit vector, meaning its length is 1. The length is computed using the formula:

∣∣ x ∣ ∣^{2} = i = 1 \sum D x [i]^{2}

Here

D

is the dimensionality of the vector.

When $D = 3$ :
The squared length of the vector equals the sum of three non-negative terms. On average, each term contributes $\frac{1}{3}$ to the total. It is not surprising if one term contributes a slightly larger proportion (e.g., more than 50%) of the total length, while the remaining two terms contribute less.
When $D = 1000$ :
The squared length now equals the sum of 1000 terms, with each term contributing, on average, $\frac{1}{1000}$ . If a single term were to contribute a disproportionately large fraction of the length (e.g., over 50%), this would require many other terms to contribute far less than their average share, corresponding to a scenario which has an exceedingly low probability.

Thus, in high-dimensional spaces, the value $x [0]$ is unlikely to deviate from 0 significantly. Formally, based on the seminal Johnson-Lindenstrauss (JL) Lemma, the value of a coordinate is unlikely to deviate from 0 by more than $\frac{2}{D}$ with sufficiently high probability. For a more detailed and rigorous explanation, please refer to the JL Lemma.

Concentration? So What?

The concentration phenomenon in high-dimensional spaces leads to some intriguing implications. Let’s revisit the example:

When $x$ has 3 dimensions, the only information we have about $x [0]$ is that it lies within $[- 1, + 1]$ , which is not very informative.
When $x$ has 1000 dimensions, the concentration phenomenon tells us that $x [0]$ lies within $(- \frac{2}{D}, \frac{2}{D})$ with high probability. Since $D$ is large, this represents a much tighter bound on $x [0]$ in high-dimensional spaces.

This is quite surprising, because we did not access a single bit of data or perform any computation to reach this conclusion. Yet, the uncertainty about $x [0]$ significantly decreases: its range shrinks from $[- 1, 1]$ to $(- \frac{2}{D}, \frac{2}{D})$ . In other words, we gain valuable information without doing any computation!

This insight opens up opportunities for improvement in algorithms by leveraging this "free information". One area where this can be particularly beneficial is quantization. By utilizing this phenomenon effectively, we can potentially achieve improvements without additional costs.

RaBitQ

Quantization is generally a family of methods developed for vector compression. In vector search, it uses the compressed vectors for estimating distances or inner products. Our research RaBitQ and its extensions can be basically regarded as optimized approaches to binary and scalar quantization, respectively.

In particular, RaBitQ achieves promising performance by effectively leveraging the "free information" discussed earlier. In this post, we will focus on explaining how this "free information" is utilized, leaving the implementation details aside. For those interested in a deeper dive into the technical aspects, we highly recommend referring to our papers and code repos.

The workflow of quantization in general includes two phase:

Index Phase: In this phase, a quantization codebook is constructed. Each vector in the database is then assigned to its nearest vector in the codebook, which serves as its quantized representation.
Query Phase: In this phase, the quantized vectors are used to approximate metrics such as Euclidean distances, inner products, or cosine similarity.

Let us use Euclidean distance as an example in the following discussion, while other metrics can be easily supported with similar derivation.

Index Phase

Step 1: Normalization

Let $o_{r} and q_{r}$ be a data vector and a query vector respectively. Now we target to estimate their Euclidean distance. Let $c$ be a centroid of data vectors. To ease the question of distance estimation, we first reduce it to the estimation of inner product between unit vectors. We consider normalzing the vectors with respect to the centroid, i.e., we take $o := \frac{o _{r} - c}{∣∣ o _{r} - c ∣∣}, q := \frac{q _{r} - c}{∣∣ q _{r} - c ∣∣}$ .

Then using the following equation, we can reduce the question of distance estimation to the one of inner product estimation over unit vectors.

∣∣ o_{r} - q_{r} ∣ ∣^{2} = ∣∣ (o_{r} - c) - (q_{r} - c) ∣ ∣^{2} = ∣∣ o_{r} - c ∣ ∣^{2} + ∣∣ q_{r} - c ∣ ∣^{2} - 2 \cdot ∣∣ o_{r} - c ∣∣ \cdot ∣∣ q_{r} - c ∣∣ \cdot ⟨ q, o ⟩

Here $∣∣ o_{r} - c ∣∣$ can be precomputed in the index phase and $∣∣ q_{r} - c ∣∣$ can be computed once and shared by all data vectors. Thus, the question is reduced to that of estimating $⟨ q, o ⟩$ . In intuition, normalization can put a cluster of vectors at the center of the space and further align each onto the unit sphere. Thus, this operation spreads the data vectors evenly on the unit sphere.

Step 2: Codebook Construction

Given that the raw data vectors have been converted into unit vectors that spreads evenly on the unit sphere, intuitively, we should also construct a codebook which spreads evenly on the unit sphere. Here we take a natural construction of a hypercube nested in the unit sphere, which is illustrated as follows.

Recall that we target to utilize the "free information" which comes from the randomness. Here we inject the codebook some randomness by randomly rotating it.

Then for the data vector $o$ , we find its nearest vector $o ˉ$ in the codebook. Because $o ˉ$ is a vector on the hypercube, it can be stored in 1-bit per dimension.

Query Phase: Distance Estimation

Now it's time to construct an estimator for the inner product $⟨ q, o ⟩$ based on the quantized vector. To achieve this, let us first analyze the geometric relationship among the vectors $o, q and o ˉ$ .

In particular, we find that although $o, q and o ˉ$ are vectors in high-dimensional spaces. To estimate $⟨ q, o ⟩$ , we only need to focus on the 2-dimensional plane which hosts $o and q$ , which is illustrated as follows.

Here $e_{1}$ is a unit vector which is orthogonal to $o$ and is in the plane. By analyzing the geometric relationship of the vectors on the plane, we derive the following equation regarding the inner product among the vectors.

⟨ o ˉ, q ⟩ = ⟨ o ˉ, o ⟩ \cdot ⟨ o, q ⟩ + ⟨ o ˉ, e_{1} ⟩ \cdot 1 - ⟨ o, q ⟩^{2}

This is to say, if we know the values of the variables other than $⟨ o, q ⟩$ in the equation, we can exactly compute the value of $⟨ o, q ⟩$ by solving the equation.

Here $⟨ o ˉ, o ⟩$ can be precomputed during indexing because it is independent of the query. $⟨ o ˉ, q ⟩$ can be done during querying with some tricky implementations, which we refer readers to our original papers and code repos. However, the value of $⟨ o ˉ, e_{1} ⟩$ poses intrinsic hardness in the computation: it depends on both the data vector and query vector which can neither be computed during indexing nor computed efficiently during querying without accessing the original data vector.

Recall that during indexing, we inject the codebook some randomness and the randomness may bring "free information" in high-dimensional spaces. Would it be the case here? We sample many different random rotation matrices and investigate the empirical distribution of the variable $⟨ o ˉ, e_{1} ⟩$ .

Here in the figure above, each red point represents the projection of a sample of $o ˉ .$ The vertical axis of red point represents the value $⟨ o ˉ, e_{1} ⟩$ . Note that in this case, $e_{1}$ is just a unit vector. The possible region of its projection on the plane is the whole green disk. However, this empirical study shows that the actual region of its projection is around the red point cloud, which is much smaller. Concentration happens.

This result demonstrates that the variable $⟨ o ˉ, e_{1} ⟩$ is highly concentrated around 0. As a result, treating it as 0 in the computation of our target will produce promising accuracy even though we perform no computation on $⟨ o ˉ, e_{1} ⟩$ .

Based on this result, we derive the following estimator for our target.

⟨ o, q ⟩ \approx \frac{⟨ o ˉ , q ⟩}{⟨ o ˉ , o ⟩}

By rigorously analyzing the extent of concentration, we prove that the error of the estimation is only

O (\frac{1}{D})

, which achieves the asymptotic optimality in theory. Recall the "free information" discussion above, the error bound exactly matches the one we presented in the beginning: indicating that this algorithm fully utilizes the "free information" gained through the concentration phenomenon in high-dimensional spaces. This is the core reason that RaBitQ achieves great accuracy gain.

More About RaBitQ

So far, we’ve explored the key insights behind RaBitQ. However, this is just the beginning. The RaBitQ paper accompanying code repositories include more innovations in the implementation, optimization and applications. Note that the original RaBitQ only supports binary quantization. Our extended version further provides support of the quantization from 2-bit and beyond, which can been regarded as a more optimized approach to scalar quantization. Besides leveraging the "free information", in intuition, this extension introduces a new strategy which performs scalar quantization by trying different parameters on a per-vector basis, computes the quantization error produced under every certain parameter and selects the optimal parameter and vector in the codebook to minimizes the error. It turns out that (1) using the same compression rates (e.g., 2-bit, 3-bit and 4-bit), it achieves significantly better accuracy and (2) its distance computation can be achieved with exactly the implementation of scalar quantization: it can replace scalar quantization seamlessly.

RaBitQ has been adopted in several real-world systems, demonstrating its broad applicability and impact. For example, Tensorchord develops a highly cost-effective solution of vector search, where RaBitQ is one of the components. They also provide an in-depth blog illustrating how to optimize this algorithm step by step in Rust. Additionally, Elastic incorporated our RaBitQ algorithm into their BBQ feature. Their evaluation further validates the breakthrough performance of RaBitQ compared to the classical Product Quantization. Three months ago (Sep-2024), we updated with an extension of RaBitQ that enhances scalar quantization, and we hope to see its adoption continue to drive performance gains in production systems.