DEV Community: Leonardo Kuffo

Sub-millisecond similarity search on IVF indexes with PDX

Leonardo Kuffo — Fri, 25 Jul 2025 14:21:35 +0000

In a previous blog post, I talked about PDX. PDX is a data layout that transposes vectors in a column-major order. This layout unleashes the true potential of dimension pruning algorithms for similarity search.

In this blog post, I discuss how we recently improved this first version of the PDX layout and achieved sub-millisecond similarity search on millions of vectors using only vanilla IVF indexes on the PDX layout. This is remarkable, as vanilla IVF indexes are deemed "slow" by many vector database vendors.

Of course, you can skip the intro and go directly to the benchmarks ;)

Recap: The PDX layout

PDX is a transposed layout (a.k.a. columnar, vertical, or decomposed layout), which means that the same dimension of different vectors are stored sequentially. This decomposition occurs within a block (e.g., a cluster in an IVF index).

The PDX layout unleashes the true potential of pruning algorithms (e.g., ADSampling). The idea of pruning is to avoid checking all the dimensions of a vector to determine if it is a neighbor of a query. PDX is the ideal environment for pruning, as we avoid introducing a control hazard in the core distance calculation kernel to ask whether we can prune a vector. You can read more details about this and other benefits of the PDX layout in my previous blog post.

However, when increasing the number of retrieved neighbors (e.g., K = 100), the pruning threshold loosens. As a result, a higher number of candidates is explored in the pruning phase of the search algorithm, resulting in increased random access in the transposed layout. The same problem occurs when few clusters are explored (e.g., 8 clusters), as the number of candidates in the pruning phase is high.

An improved PDX layout

We have evolved the PDX layout from the one presented in our publication. This new iteration reduces random access and is less impacted by looser pruning thresholds. Hereby, it is more robust and accelerates a wider variety of settings (e.g., K = 100, fewer clusters probed). Furthermore, we adapted the layout to work with smaller data types such as 8-bit and 1-bit vectors.

The key observation for these improvements is that pruning algorithms, such as ADSampling, access dimensions in a sequential order. With this in mind, let’s look at the newer versions of PDX.

`float32`

For float32, the first 25% of the dimensions are fully decomposed, just as the original PDX. We refer to this as the "vertical block." The rest (75%) are decomposed into subvectors of 64 dimensions. We refer to this as the "horizontal block." The vertical block is used for efficient pruning, and the horizontal block is accessed on the candidates that were not pruned. This horizontal block is still decomposed every 64 dimensions. The idea behind this is that we still have a chance to prune the few remaining candidates every 64 dimensions.

The following image shows this layout. Storage is sequential from left to right, and from top to bottom:

With this layout, random access is significantly reduced during the pruning phase of the search, while still benefiting from the adaptiveness per query and dataset provided by the vertical block. Another trick we incorporated into our algorithm is that if the remaining candidates are already low before finishing the scan of the vertical block, we directly jump to the horizontal blocks. As a result, random access of individual dimensions only happens for an extremely low number of vectors.

`8 bits`

Smaller data types are not friendly to the original PDX, as we must accumulate distances on wider types, resulting in asymmetry. We can work around this by changing the layout. For 8 bits, the vertical block is decomposed every 4 dimensions. This allows us to use dot product instructions (VPDPBUSD in x86 and UDOT/SDOT in NEON) to calculate L2 or IP kernels while still benefiting from PDX. The horizontal block remains decomposed every 64 dimensions.

`1 bit`

For Hamming/Jaccard kernels, we can use a layout decomposed every 8 dimensions (naturally grouped into bytes). The population count accumulation can be done in bytes. If d > 256, we flush the popcounts into a wider type every 32 words. This has not been implemented in our repository yet, but you can find some promising benchmarks here.

Sub-millisecond similarity search with PDX and IVF₂

Finally, we introduce IVF₂, a two-level IVF index that tackles a bottleneck of IVF indexes: finding the nearest centroids. The idea is simple: we cluster the centroids generated from the IVF index. This generates a new reduced set of super-centroids, which we scan first to determine which clusters of centroids are the most promising. Then, PDX can quickly scan these super-clusters which contain the original centroids, without sacrificing recall, thanks to pruning. In brief, we are simply performing a new k-means clustering on the centroids to enable the use of pruning algorithms when scanning them.

This achieves significant throughput improvements when paired with 8-bit quantization. IVF₂ is so fast scanning the vectors that now the bottleneck of our algorithm is the random rotation of the query vector needed for ADSampling. We are planning to improve this in a future release.

Benchmarks

We present single-threaded benchmarks against FAISS on r7iz.xlarge (x86_64, Intel Sapphire Rapids with AVX512) and r8g.xlarge (ARM, Graviton 4 with NEON/SVE) instances. We used 3 datasets:

OpenAI embeddings: N=1M, D=1536
MxbAI embeddings: N=769K, D=1024
arXiv embeddings: N=2.25M, D=768

Two-Level IVF (IVF₂)

IVF₂ can achieve sub-millisecond query latency on both x86 and ARM, accelerating SIMD-optimized FAISS by factors ranging from 2x to 13x, depending on the target recall.

Intel Sapphire Rapids (x86_64 with AVX512)

In the OpenAI dataset at 0.90, the random rotation of the query vector dominates the runtime. Therefore, we may also achieve sub-millisecond performance here by improving the random rotation of the query with a faster algorithm.

A key factor for these speedups is that we do not decode the data to the float32 domain. Instead, we transform the query into the 8-bit domain and perform the distance calculation using 8-bit SIMD.

In ARM, things get uglier for FAISS as they lack SIMD for 8-bit decoding into float32:

Graviton 4 (ARM with NEON)

Vanilla IVF

In vanilla IVF indexes with float32 vectors, PDX demonstrates its superiority, now even with a larger number of neighbors:

Intel Sapphire Rapids (x86_64 with AVX512)

While still maintaining its superiority with a smaller number of neighbors (K=10):

Intel Sapphire Rapids (x86_64 with AVX512)

Exhaustive search + IVF

An exhaustive search scans all the vectors in the collection. Having an IVF index with PDX can EXTREMELY accelerate this without sacrificing recall, thanks to the reliable pruning of ADSampling. We can accelerate a complete scan of the collection up to 50x!

Intel Sapphire Rapids (x86_64 with AVX512)

The key observation here is that, thanks to the underlying IVF index, the exhaustive search starts with the most promising clusters. A tight threshold is found early on, which enables the quick pruning of most candidates.

Note that, on these datasets, the exhaustive search with SQ8 (8-bit quantization) achieves > 0.9999 recall.

What’s next?

We think PDX performance is remarkable, given that, at its core, it is a vanilla IVF index with traditional scalar quantization. The latter means that we can achieve sub-millisecond similarity search alongside all the benefits of IVF indexes: a low memory footprint and low construction times.

For now, improving the random rotation algorithm is key to further reducing query latency. The RaBitQ-Library already proposes an alternative with an O(nlog(n)) complexity, in contrast to the vanilla rotation, which has an O(N^2) complexity.

We will soon add PDX to the VIBE benchmark. So, stay tuned!

PDX is an open-source project. You can try it yourself in our GitHub repository.

AWS Graviton 3 > Graviton 4 for Vector Similarity Search

Leonardo Kuffo — Sun, 30 Mar 2025 17:48:35 +0000

If you are doing vector search with a vector library that supports SVE, you should use a Graviton 3 machine. It is cheaper, and it will also deliver more raw performance.

A few months ago, we started working on a vertical layout for vector similarity search (PDX). As part of the benchmarks that we were running on different microarchitectures and vector systems like FAISS, Milvus, and Usearch, there was an observation that puzzled us: Graviton3 performed better than Graviton4 in almost all vector search scenarios, not only in queries per dollar (QP$) but also in queries per second (QPS). This was the case across vector libraries and even in our implementations of vector search algorithms. Here is one example of the QPS and QP$ of both microarchitectures on queries to an IVF index:

QPS and QP$ on IVF indexes (float32) using FAISS+SVE. QP$ are in the order of 10^4.

In the OpenAI/1536 dataset with 1M vectors, Graviton3 delivers 25% more queries per dollar than Graviton4!

Let’s be clear: Graviton4 is a better machine than Graviton3. It has a higher clock frequency, a 2x bigger L2 cache, a slightly bigger L3, less memory-access latency, and a much more capable CPU (upgrading from Neoverse v1 to Neoverse v2). This is shown not only by AWS but also by innumerable benchmarks. I can refer to the benchmarks done by Daniel Lemire, Phoronix, and Chips and Cheese. But then, why would Graviton3 be better than Graviton4 on Vector Similarity Search?

The main culprit is that Graviton4 has a SVE SIMD register size of 128 bits —half of the 256-bit registers of Graviton3.

In the rest of this blog post, we will dive deep into why this difference is particularly detrimental to the performance of vector similarity search and why this hasn’t been picked up by other benchmarks. But before discussing about the Gravitons, let's refresh our knowledge of SIMD.

SIMD in Vector Similarity Search

Distance calculations in vector similarity search can be optimized using Single-Instruction-Multiple-Data (SIMD) instructions available in the CPU. These special instructions can process multiple values in parallel with a single CPU instruction.

In x86_64 architectures, SIMD instructions are called Advanced Vector Extensions (AVX). The number of values AVX can process at a time depends on the SIMD register width supported by the CPU. Modern CPU microarchitectures, like Zen4 and Intel Sapphire Rapids, have 512-bit SIMD registers (AVX512), which can process 16 float32 values with one CPU instruction.

Let’s look at the following C++ code (taken from the SimSIMD codebase) that uses AVX512 SIMD to calculate the Euclidean distance (L2) between two vectors:

void l2sq_f32_avx512(float const *a, float const *b, size_t n, float *result) {
    __m512 d2_vec = _mm512_setzero();
    __m512 a_vec, b_vec;

l2sq_f32_loop:
    a_vec = _mm512_loadu_ps(a);
    b_vec = _mm512_loadu_ps(b);
    a += 16, b += 16, n -= 16;
    __m512 d_vec = _mm512_sub_ps(a_vec, b_vec);
    d2_vec = _mm512_fmadd_ps(d_vec, d_vec, d2_vec);
    if (n) goto l2sq_f32_loop;

    *result = reduce_f32x16(d2_vec);
}

The critical part is the l2sq_f32_loop that loops through the vector dimensions. In each iteration, _mm512_loadu_ps loads 16 packed single-precision values (32 bits each) into the SIMD register. We do this twice, once for vector a and another for vector b. Then, we do the L2 calculation by doing a subtraction (_mm512_sub_ps) and a fused multiply-add (_mm512_fmadd_ps) that accumulates the distances into a result register (d2_vec). We keep repeating the loop until we have inspected all the vector dimensions (when n == 0). Finally, we have to sum all the values in the SIMD register to get our total distance (reduce_f32x16). We will obviate the explanation of this last step.

On the other hand, ARM architectures also provide SIMD instructions through NEON and SVE. NEON was introduced first, supporting SIMD over 128-bit registers (fitting 4 float32 values at a time). SVE was introduced later. Unlike AVX and NEON, SVE supports variable-size SIMD registers on its intrinsics through VLA (Variable Length Agnostic) programming. The latter alleviates technical debt as distance kernels no longer need hardware-dependent loop lengths.

Let’s take a look now at a code with SVE intrinsics (also taken from SimSIMD):

void l2sq_f32_sve(float const *a, float const *b, size_t n, float *result) {
    size_t i = 0;
    svfloat32_t d2_vec = svdupq_n_f32(0.f, 0.f, 0.f, 0.f);
    do {
        svbool_t pg_vec = svwhilelt_b32((unsigned int)i, (unsigned int)n);
        svfloat32_t a_vec = svld1_f32(pg_vec, a + i);
        svfloat32_t b_vec = svld1_f32(pg_vec, b + i);
        svfloat32_t a_minus_b_vec = svsub_f32_x(pg_vec, a_vec, b_vec);
        d2_vec = svmla_f32_x(pg_vec, d2_vec, a_minus_b_vec, a_minus_b_vec);
        i += svcntw();
    } while (i < n);
    float d2 = svaddv_f32(svptrue_b32(), d2_vec);
    *result = d2;
}

svld1_f32 is the SIMD intrinsic that loads the single-precision vector into the SIMD register. svsub_f32_x and svmla_f32_x do the subtraction and fused multiply-add, resp. The main difference between SVE and AVX512 is that each loop iteration length is not controlled with a constant but with another intrinsic (svcntw()) that resolves to the register width of the underlying CPU. Recall that in the AVX512 code, we do n-=16 to advance through the loop. SVE has additional intricacies. For example, every intrinsic call must be predicated/masked. But we will not dive deep into SVE programming.

Back to the Gravitons: From 256-bit to 128-bit SVE registers

Both Gravitons support NEON and SVE SIMD. In terms of NEON, both microarchitectures have 128-bit SIMD registers. However, in terms of SVE, Graviton4 has 128-bit registers, while Graviton3 has 256-bit registers.

A smaller SIMD register does not mean slower performance. Yes, every instruction call will process fewer values, but the performance also depends on the execution throughput and latencies of the used instructions. The execution throughput is defined on ARM guides as "the maximum number of instructions per CPU cycle an instruction can achieve." The latter depends on the CPU design and the ports in which CPU instructions are dispatched. On the other hand, latency is defined as "the delay (in clock cycles) that the instruction generates in a dependency chain."

Let’s compare both microarchitectures execution throughput and latencies on the relevant instructions for our L2 distance kernel: FMADD and LOAD.

Execution Throughput

The first number in each cell is the execution throughput. To translate this to effective throughput of data, we multiply it by the size of the register that executes that instruction.

Microarchitecture	NEON `FMADD`	SVE `FMADD`	NEON `LOAD`	SVE `LOAD`
Graviton 3 (Neoverse v1)	4 x 128 = 512	2 x 256 = 512	3 x 128 = 384	2 x 256 = 512
Graviton 4 (Neoverse v2)	4 x 128 = 512	4 x 128 = 512	3 x 128 = 384	3 x 128 = 384

*These numbers are taken from the official microarchitecture guides of Neoverse v1 and Neoverse v2

We see that Graviton4 maintains the execution throughput of floating-point arithmetic despite the smaller register width. From these numbers, only one stands out: the SVE LOAD. Graviton3 can load 33% more data than Graviton4 in one CPU cycle, which is more than the upgrade in clock frequency from Graviton3 to Graviton4 (around 8%). This gives an advantage to Graviton3 if the data is cache resident. In fact, this is reflected in our read memory bandwidth benchmarks that show that when data is L1-resident, the read bandwidth using SVE intrinsics is 26% higher in Graviton3.

SVE Read Memory Bandwidth on Graviton3 (r7g.metal) and Graviton4 (r8g.metal-24xl)

However, in vector search, we are usually on the case in which data is in L3/DRAM (in IVF indexes or full scans) or, at best, in L2 (e.g., in the top layer of HNSW indexes). Here, the difference in read bandwidth is small.

Latencies

Microarchitecture	NEON `FMADD`	SVE `FMADD`	NEON `LOAD`	SVE `LOAD`
Graviton 3 (Neoverse v1)	4	4	4	6
Graviton 4 (Neoverse v2)	4	4	4	6

*These numbers are taken from the official microarchitecture guides of Neoverse v1 and Neoverse v2

The latencies are the same in both CPUs. However, the total latency cost to load the same amount of data is higher in Graviton4 due to the smaller register width. For example, the latency cost of calling FMADD in Graviton4 to process 128 bits (4 float32 values) is 4 cycles. However, in Graviton3, we spend the same 4 cycles to process 256 bits (8 float32 values).

Also, recall that in our AVX and SVE code of the L2 distance kernel, each iteration of the loop depends on the previous one since all the FMADDs are accumulating distances on the same SIMD register. This creates a dependency chain, making it harder for the CPU to leverage features such as out-of-order execution. Therefore, the SIMD register width becomes more critical in similarity calculations, as instructions may not be able to be executed in parallel up to their maximum throughput.

Of course, it is hard to precisely determine the impact of these extra cycles and less effective throughput on the bigger picture of a vector similarity search query. Each CPU microarchitecture is wildly different and implements different mechanisms to maintain performance despite having a smaller register width. Therefore, let's run some benchmarks!

Benchmarks: Graviton 3 vs Graviton 4

We compared the queries per second (QPS) and queries per dollar (QP$) given by Graviton3 (r7g.2xlarge, $0.4284/h in us-east-1) and Graviton4 (r8g.2xlarge, $0.4713/h) on a variety of vector search scenarios. The machines have Ubuntu 24 with GCC 13 and LLVM 18.

We used 2 datasets contrasting in their dimensionality:

OpenAI (D=1536, N=1M)
SIFT (D=128, N=1M)

For these single-threaded benchmarks, we used FAISS (compiled with SVE) and USearch.

IVF Indexes in FAISS (`float32`)

Graviton3 performs better at all recall levels and even more so in the dataset of higher dimensionality. Things look worse for Graviton4 when we consider its price, as Graviton3 is a cheaper machine (around 10% cheaper).

FAISS IVF Flat index: Graviton4 vs Graviton3

Notice how the gap closes on the dataset with a lower dimensionality. However, it is nowhere near the 30% performance improvement AWS promises when jumping from Graviton3 to Graviton4.

HNSW Indexes in USearch

On float32, again, Graviton3 performs better at all recall levels. Graviton3 delivers 5% more QPS and 15% more QP$ in the OpenAI dataset at the highest recall level. And 13% and 25% more QPS and QPS, resp., in the SIFT/128 dataset.

USearch HNSW Flat index: Graviton4 vs Graviton3

The story is the same for quantized vectors. Here, we show only the performance at the highest possible recall level:

On the OpenAI/1536 dataset

On the SIFT/128 dataset

In the dataset of smaller dimensionality (SIFT/128), G4 takes the upper hand on QPS, but G3 remains competitive on QP$. Here, the bigger L2 of G4 could be kicking in due to the entry nodes of the HNSW index being cached more efficiently. Also, smaller vectors imply fewer calls to SIMD instructions, which benefits G4.

Note 1: USearch switches to NEON for 1-bit vectors if the vectors are of 128 dimensions, which is the case here.

Note 2: We did not benchmark quantized vectors in FAISS because FAISS does asymmetric distance calculations. This in itself would not be a problem, but for ARM, FAISS does not use SIMD instructions to go from the 8-bit, 6-bit, 4-bit domain to the float32 domain. This leads to poor performance in both architectures (>6x slower than Zen4).

Raw Distance Calculations (1-to-many)

We ran a standalone benchmark of L2 distance calculations to eliminate possible artifacts and overhead introduced by vector systems. Here, we used randomly generated float32 collections of different sizes and dimensionalities. The collections range from being small enough to fit in L1 and large enough to spill to DRAM. Our code is as simple as it can get: Put the vectors in memory and do 1-to-many distance calculations with the L2 kernels taken from SimSIMD. Here, we do not do a KNN search; we only do pure distance calculations. We repeat this experiment thousands of times to warm up the caches.

NEON vs NEON

When using NEON kernels, Graviton4 is, on average, 10% faster than Graviton3 across all settings. This improvement is on par with the increase in clock frequency between both microarchitectures.

SVE vs SVE

When switching from NEON to SVE, Graviton3 saw a 37% performance improvement over its NEON counterpart. Ash Vardanian also reported similar findings. However, Graviton4 doesn’t find any benefits when using SVE (in fact, compiling FAISS to SVE or not yields the same performance). Actually, SVE is slightly slower than NEON in G4. These could be fluctuations/noise of the benchmarks or the overhead of masked operations in SVE.

When comparing G3 SVE vs G4 SVE, these are the results across scenarios:

Graviton3 is, on average, 31% faster than Graviton4.

You can check these fully disaggregated benchmarks for each collection size and dimensionality on this spreadsheet. We would like to bring forward a few things regarding these benchmarks: (i) The wider the vectors, the bigger the gap between G3 and G4. (ii) If the vectors fit in the cache, Graviton3 can be almost 2x faster than Graviton4. (iii) Only at a dimensionality of 8, the tables turn, and Graviton3 is slightly slower than Graviton4.

The benchmarks are clear: if you are doing vector search with a vector library that supports SVE, you should use Graviton 3. It is cheaper, and it will also deliver more raw performance in the majority of scenarios.

Why did Graviton4 regress the SVE register width?

While we do not have a proper answer to this question, we can speculate.

Currently, most code for ARM is written in NEON, where Graviton4 is better than Graviton3. A possibility is that AWS went for a CPU that performed better in existing workloads and benchmarks, which usually use code written in NEON. In fact, Daniel Lemire and Chips and Cheese benchmarks on Graviton4 vs Graviton3 used NEON code. In other words, SVE is not yet widely used.

Another possible reason is that SIMD instructions can be fully pipelined in many workloads. However, similarity calculations are different because there is a dependency chain between multiple calls to SIMD instructions. The latter benefits wider registers, especially on vectors of high dimensionalities. While we haven’t done benchmarks on SVE for other types of workloads, our intuition is that if the workload can be fully pipelined, it would be faster in Graviton4.

Closing Remarks

I have found myself inside a (nice) rabbit hole of CPU microarchitecture design and performance. We have actually done the same experiments presented in this blog post for 5 microarchitectures (Intel SPR, Graviton3, Graviton4, Zen3, and Zen4). One of our most interesting findings is that carefully choosing the microarchitecture for your vector search use case can give you up to 3x more queries per second and queries per dollar. This is the case, for instance, with Zen4 in IVF indexes compared to Intel Sapphire Rapids (despite the latter being a CPU with better specs). I am writing a blog post about that, so keep tuned!

Regarding the Gravitons, it is nonetheless a weird decision to halve the SIMD register size in the generational jump from Graviton3 to Graviton4. Perhaps in terms of the engineering design of the core, it is hard to keep supporting NEON registers of 128 bits AND SVE registers of double the size.

Not so far ago, Daniel Lemire commented that "AMD is mopping the floor with ARM in terms of SIMD."
// Detect dark theme var iframe = document.getElementById('tweet-1889150598001422645-33'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1889150598001422645&theme=dark" }

It is no rocket science: a smaller register will impact the performance of some workloads, even if the execution throughput is kept the same. Of course, when comparing different microarchitectures, more things come into play, such as memory access latency, read memory bandwidth, and the data-access patterns of the workload. Ultimately, your decision to use a microarchitecture should be based on data-driven benchmarks with your own use case. As you may have noticed, the performance depends on various factors, such as the search algorithm and the size of the vectors. Perhaps an interesting follow-up post would be to test both architectures by doing vector search under a multi-threaded setting.

For now, it seems that aside from the portability benefits, there is currently not much payoff in migrating NEON code to SVE, especially if the cores used by AWS will keep the SVE register size on par with NEON. The only exception would be when one needs to use an intrinsic that is only available in SVE.

Kudos to Ash for giving input on these findings and for putting them forward to the ARM team. We are still awaiting their input.

Visit my website to know more about me ;-)

DEV Community: Leonardo Kuffo

Sub-millisecond similarity search on IVF indexes with PDX

Recap: The PDX layout

An improved PDX layout

float32

8 bits

1 bit

Sub-millisecond similarity search with PDX and IVF₂

Benchmarks

Two-Level IVF (IVF₂)

Vanilla IVF

Exhaustive search + IVF

What’s next?

AWS Graviton 3 > Graviton 4 for Vector Similarity Search

SIMD in Vector Similarity Search

Back to the Gravitons: From 256-bit to 128-bit SVE registers

Execution Throughput

Latencies

Benchmarks: Graviton 3 vs Graviton 4

IVF Indexes in FAISS (float32)

HNSW Indexes in USearch

Raw Distance Calculations (1-to-many)

Why did Graviton4 regress the SVE register width?

Closing Remarks

`float32`

`8 bits`

`1 bit`

IVF Indexes in FAISS (`float32`)