<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Leonardo Kuffo</title>
    <description>The latest articles on DEV Community by Leonardo Kuffo (@leokuffo).</description>
    <link>https://dev.to/leokuffo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1720364%2F04b14091-62cd-4b56-aee8-d9de10116bc9.jpg</url>
      <title>DEV Community: Leonardo Kuffo</title>
      <link>https://dev.to/leokuffo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/leokuffo"/>
    <language>en</language>
    <item>
      <title>Sub-millisecond similarity search on IVF indexes with PDX</title>
      <dc:creator>Leonardo Kuffo</dc:creator>
      <pubDate>Fri, 25 Jul 2025 14:21:35 +0000</pubDate>
      <link>https://dev.to/leokuffo/sub-millisecond-similarity-search-on-ivf-indexes-with-pdx-35n8</link>
      <guid>https://dev.to/leokuffo/sub-millisecond-similarity-search-on-ivf-indexes-with-pdx-35n8</guid>
      <description>&lt;p&gt;In a &lt;a href="https://www.lkuffo.com/vertical-vector-similarity-search-pdx/" rel="noopener noreferrer"&gt;previous blog post&lt;/a&gt;, I talked about PDX. &lt;a href="https://github.com/cwida/PDX" rel="noopener noreferrer"&gt;PDX&lt;/a&gt; is a data layout that &lt;strong&gt;transposes&lt;/strong&gt; vectors in a column-major order. This layout unleashes the true potential of dimension pruning algorithms for similarity search.&lt;/p&gt;

&lt;p&gt;In this blog post, I discuss how we recently improved this first version of the PDX layout and &lt;strong&gt;achieved sub-millisecond similarity search on millions of vectors using only vanilla IVF indexes on the PDX layout.&lt;/strong&gt; This is remarkable, as vanilla IVF indexes are deemed "slow" by many vector database vendors.&lt;/p&gt;

&lt;p&gt;Of course, you can skip the intro and go directly to the benchmarks ;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Recap: The PDX layout
&lt;/h2&gt;

&lt;p&gt;PDX is a transposed layout (a.k.a. columnar, vertical, or decomposed layout), which means that the same dimension of different vectors are stored sequentially. This decomposition occurs within a block (e.g., a cluster in an IVF index). &lt;/p&gt;

&lt;p&gt;The PDX layout unleashes the true potential of pruning algorithms (e.g., &lt;a href="https://github.com/gaoj0017/ADSampling/" rel="noopener noreferrer"&gt;ADSampling&lt;/a&gt;). The idea of pruning is to avoid checking all the dimensions of a vector to determine if it is a neighbor of a query. PDX is the ideal environment for pruning, as &lt;strong&gt;we avoid introducing a control hazard in the core distance calculation kernel&lt;/strong&gt; to ask whether we can prune a vector. You can read more details about this and other benefits of the PDX layout in &lt;a href="https://www.lkuffo.com/vertical-vector-similarity-search-pdx/" rel="noopener noreferrer"&gt;my previous blog post&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;However, when increasing the number of retrieved neighbors (e.g., K = 100), the pruning threshold loosens. As a result, a higher number of candidates is explored in the pruning phase of the search algorithm, resulting in increased random access in the transposed layout. The same problem occurs when few clusters are explored (e.g., 8 clusters), as the number of candidates in the pruning phase is high. &lt;/p&gt;

&lt;h2&gt;
  
  
  An improved PDX layout
&lt;/h2&gt;

&lt;p&gt;We have evolved the PDX layout from the one presented in &lt;a href="https://arxiv.org/pdf/2503.04422" rel="noopener noreferrer"&gt;our publication&lt;/a&gt;. This new iteration reduces random access and is less impacted by looser pruning thresholds. Hereby, it is more robust and accelerates a wider variety of settings (e.g., K = 100, fewer clusters probed). Furthermore, we adapted the layout to work with smaller data types such as 8-bit and 1-bit vectors.&lt;/p&gt;

&lt;p&gt;The key observation for these improvements is that pruning algorithms, such as ADSampling, access dimensions &lt;strong&gt;in a sequential order&lt;/strong&gt;. With this in mind, let’s look at the newer versions of PDX.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;float32&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;For &lt;code&gt;float32&lt;/code&gt;, the first 25% of the dimensions are fully decomposed, just as the original PDX. We refer to this as the "vertical block." The rest (75%) are decomposed into subvectors of 64 dimensions. We refer to this as the "horizontal block." The vertical block is used for efficient pruning, and the horizontal block is accessed on the candidates that were not pruned. This horizontal block is still decomposed every 64 dimensions. The idea behind this is that we still have a chance to prune the few remaining candidates every 64 dimensions.&lt;/p&gt;

&lt;p&gt;The following image shows this layout. Storage is sequential from left to right, and from top to bottom:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Flayout-f32.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Flayout-f32.png" alt="PDX Layout for 32-bit data" width="800" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this layout, random access is significantly reduced during the pruning phase of the search, while still benefiting from the adaptiveness per query and dataset provided by the vertical block.  Another trick we incorporated into our algorithm is that if the remaining candidates are already low before finishing the scan of the vertical block, we directly jump to the horizontal blocks. As a result, random access of individual dimensions only happens for an extremely low number of vectors. &lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;8 bits&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Smaller data types are not friendly to the original PDX, as we must accumulate distances on wider types, resulting in asymmetry. We can work around this by changing the layout. For &lt;code&gt;8 bits&lt;/code&gt;, the vertical block is decomposed every 4 dimensions. This allows us to use dot product instructions (&lt;code&gt;VPDPBUSD&lt;/code&gt; in &lt;a href="https://www.officedaytime.com/simd512e/simdimg/si.php?f=vpdpbusd" rel="noopener noreferrer"&gt;x86&lt;/a&gt; and &lt;code&gt;UDOT/SDOT&lt;/code&gt; in &lt;a href="https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-" rel="noopener noreferrer"&gt;NEON&lt;/a&gt;) to calculate L2 or IP kernels while still benefiting from PDX. The horizontal block remains decomposed every 64 dimensions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Flayout-u8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Flayout-u8.png" alt="PDX Layout for 8-bit data" width="800" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;1 bit&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;For Hamming/Jaccard kernels, we can use a layout decomposed every 8 dimensions (naturally grouped into bytes). The population count accumulation can be done in &lt;code&gt;bytes&lt;/code&gt;. If d &amp;gt; 256, we flush the popcounts into a wider type every 32 words. This has not been implemented in our repository yet, but you can find some promising benchmarks &lt;a href="https://github.com/lkuffo/binary-index/blob/main/README_LK.md#column-major-layout" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sub-millisecond similarity search with PDX and IVF₂
&lt;/h2&gt;

&lt;p&gt;Finally, we introduce IVF₂, a two-level IVF index that tackles a bottleneck of IVF indexes: finding the nearest centroids. The idea is simple: we cluster the centroids generated from the IVF index. This generates a new reduced set of super-centroids, which we scan first to determine which clusters of centroids are the most promising. Then, PDX can quickly scan these super-clusters which contain the original centroids, without sacrificing recall, thanks to pruning. In brief, we are simply performing a new k-means clustering on the centroids to enable the use of pruning algorithms when scanning them.&lt;/p&gt;

&lt;p&gt;This achieves significant throughput improvements when paired with 8-bit quantization. IVF₂ is so fast scanning the vectors that now the bottleneck of our algorithm is the random rotation of the query vector needed for ADSampling. We are planning to improve this in a future release.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;We present single-threaded &lt;strong&gt;benchmarks&lt;/strong&gt; against FAISS on &lt;code&gt;r7iz.xlarge&lt;/code&gt; (x86_64, Intel Sapphire Rapids with AVX512) and &lt;code&gt;r8g.xlarge&lt;/code&gt; (ARM, Graviton 4 with NEON/SVE) instances. We used 3 datasets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI embeddings: N=1M, D=1536&lt;/li&gt;
&lt;li&gt;MxbAI embeddings: N=769K, D=1024&lt;/li&gt;
&lt;li&gt;arXiv embeddings: N=2.25M, D=768&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Two-Level IVF (IVF₂)
&lt;/h3&gt;

&lt;p&gt;IVF₂ can achieve &lt;strong&gt;sub-millisecond query latency&lt;/strong&gt; on both x86 and ARM, accelerating SIMD-optimized FAISS by factors ranging from 2x to 13x, depending on the target recall.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf2-intel.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf2-intel.png" alt="Intel Sapphire Rapids (x86_64 with AVX512)" width="800" height="305"&gt;&lt;/a&gt;&lt;br&gt;Intel Sapphire Rapids (x86_64 with AVX512)
  &lt;/p&gt;

&lt;p&gt;In the OpenAI dataset at 0.90, the random rotation of the query vector dominates the runtime. Therefore, we may also achieve sub-millisecond performance here by improving the random rotation of the query with a faster algorithm.&lt;/p&gt;

&lt;p&gt;A key factor for these speedups is that we do not decode the data to the &lt;code&gt;float32&lt;/code&gt; domain. Instead, we transform the query into the &lt;code&gt;8-bit&lt;/code&gt; domain and perform the distance calculation using 8-bit SIMD.&lt;/p&gt;

&lt;p&gt;In ARM, things get uglier for FAISS as they &lt;a href="https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L190" rel="noopener noreferrer"&gt;lack SIMD&lt;/a&gt; for &lt;code&gt;8-bit&lt;/code&gt; decoding into &lt;code&gt;float32&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf2-arm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf2-arm.png" alt="Graviton 4 (ARM with NEON)" width="800" height="305"&gt;&lt;/a&gt;&lt;br&gt;Graviton 4 (ARM with NEON)
  &lt;/p&gt;

&lt;h3&gt;
  
  
  Vanilla IVF
&lt;/h3&gt;

&lt;p&gt;In vanilla IVF indexes with &lt;code&gt;float32&lt;/code&gt; vectors, PDX demonstrates its superiority, now even with a larger number of neighbors:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf-intel.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf-intel.png" alt="Intel Sapphire Rapids (x86_64 with AVX512)" width="800" height="305"&gt;&lt;/a&gt;&lt;br&gt;Intel Sapphire Rapids (x86_64 with AVX512)
  &lt;/p&gt;

&lt;p&gt;While still maintaining its superiority with a smaller number of neighbors (K=10):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf-intel-k10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf-intel-k10.png" alt="Intel Sapphire Rapids (x86_64 with AVX512)" width="800" height="305"&gt;&lt;/a&gt;&lt;br&gt;Intel Sapphire Rapids (x86_64 with AVX512)
  &lt;/p&gt;

&lt;h3&gt;
  
  
  Exhaustive search + IVF
&lt;/h3&gt;

&lt;p&gt;An exhaustive search scans all the vectors in the collection. Having an IVF index with PDX can &lt;strong&gt;EXTREMELY&lt;/strong&gt; accelerate this without sacrificing recall, thanks to the reliable pruning of ADSampling. &lt;strong&gt;We can accelerate a complete scan of the collection up to 50x!&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf-exhaustive-intel.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flkuffo.com%2Fsub-ms-pdx%2Fivf-exhaustive-intel.png" alt="Intel Sapphire Rapids (x86_64 with AVX512)" width="800" height="283"&gt;&lt;/a&gt;&lt;br&gt;Intel Sapphire Rapids (x86_64 with AVX512)
  &lt;/p&gt;

&lt;p&gt;The key observation here is that, thanks to the underlying IVF index, the exhaustive search starts with the most promising clusters. A tight threshold is found early on, which enables the quick pruning of most candidates.&lt;/p&gt;

&lt;p&gt;Note that, on these datasets, the exhaustive search with SQ8 (8-bit quantization) achieves &amp;gt; 0.9999 recall.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s next?
&lt;/h2&gt;

&lt;p&gt;We think PDX performance is remarkable, given that, at its core, it is a vanilla IVF index with traditional scalar quantization. The latter means that &lt;strong&gt;we can achieve sub-millisecond similarity search alongside all the benefits of IVF indexes&lt;/strong&gt;: a low memory footprint and low construction times.&lt;/p&gt;

&lt;p&gt;For now, improving the random rotation algorithm is key to further reducing query latency. The &lt;a href="https://vectordb-ntu.github.io/RaBitQ-Library/" rel="noopener noreferrer"&gt;RaBitQ-Library&lt;/a&gt; already proposes an alternative with an &lt;code&gt;O(nlog(n))&lt;/code&gt; complexity, in contrast to the vanilla rotation, which has an &lt;code&gt;O(N^2)&lt;/code&gt; complexity.&lt;/p&gt;

&lt;p&gt;We will soon add PDX to the &lt;a href="https://vector-index-bench.github.io/" rel="noopener noreferrer"&gt;VIBE benchmark&lt;/a&gt;. So, stay tuned!&lt;/p&gt;

&lt;p&gt;PDX is an open-source project. You can try it yourself in our &lt;a href="https://github.com/cwida/PDX" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>similaritysearch</category>
      <category>vectordatabase</category>
      <category>programming</category>
      <category>rag</category>
    </item>
    <item>
      <title>AWS Graviton 3 &gt; Graviton 4 for Vector Similarity Search</title>
      <dc:creator>Leonardo Kuffo</dc:creator>
      <pubDate>Sun, 30 Mar 2025 17:48:35 +0000</pubDate>
      <link>https://dev.to/leokuffo/aws-graviton-3-graviton-4-for-vector-similarity-search-23b0</link>
      <guid>https://dev.to/leokuffo/aws-graviton-3-graviton-4-for-vector-similarity-search-23b0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If you are doing vector search with a vector library that supports SVE, you should use a Graviton 3 machine. It is cheaper, and it will also deliver more raw performance.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A few months ago, we started working on a vertical layout for vector similarity search (&lt;a href="https://github.com/cwida/PDX" rel="noopener noreferrer"&gt;PDX&lt;/a&gt;). As part of the benchmarks that we were running on different microarchitectures and vector systems like FAISS, Milvus, and Usearch, there was an observation that puzzled us: &lt;strong&gt;Graviton3 performed better than Graviton4 in almost all vector search scenarios&lt;/strong&gt;, not only in queries per dollar (QP$) but also in queries per second (QPS). This was the case across vector libraries and even in our implementations of vector search algorithms. Here is one example of the QPS and QP$ of both microarchitectures on queries to an &lt;a href="https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes" rel="noopener noreferrer"&gt;IVF index&lt;/a&gt;: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ien7ccax97rn6ajh387.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ien7ccax97rn6ajh387.png" alt="Recall@10: 0.99 on IVF indexes"&gt;&lt;/a&gt;&lt;br&gt;QPS and QP$ on IVF indexes (&lt;code&gt;float32&lt;/code&gt;) using FAISS+SVE. QP$ are in the order of 10^4.
  &lt;/p&gt;

&lt;p&gt;In the OpenAI/1536 dataset with 1M vectors, Graviton3 delivers 25% more queries per dollar than Graviton4! &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let’s be clear:&lt;/strong&gt; Graviton4 is a better machine than Graviton3. It has a higher clock frequency, a 2x bigger L2 cache, a slightly bigger L3, less memory-access latency, and a much more capable CPU (upgrading from Neoverse v1 to Neoverse v2). This is shown not only by AWS but also by innumerable benchmarks. I can refer to the benchmarks done by &lt;a href="https://lemire.me/blog/2024/07/10/benchmarking-arm-processors-graviton-4-graviton-3-and-apple-m2/" rel="noopener noreferrer"&gt;Daniel Lemire&lt;/a&gt;, &lt;a href="https://www.phoronix.com/review/aws-graviton4-benchmarks" rel="noopener noreferrer"&gt;Phoronix&lt;/a&gt;, and &lt;a href="https://chipsandcheese.com/p/arms-neoverse-v2-in-awss-graviton-4" rel="noopener noreferrer"&gt;Chips and Cheese&lt;/a&gt;. But then, &lt;em&gt;why would Graviton3 be better than Graviton4 on Vector Similarity Search?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The main culprit&lt;/strong&gt; is that Graviton4 has a SVE SIMD register size of 128 bits &lt;strong&gt;&lt;em&gt;—half of the 256-bit registers of Graviton3&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the rest of this blog post, we will dive deep into &lt;em&gt;why&lt;/em&gt; this difference is particularly detrimental to the performance of vector similarity search and &lt;em&gt;why&lt;/em&gt; this hasn’t been picked up by other benchmarks. But before discussing about the Gravitons, let's refresh our knowledge of SIMD. &lt;/p&gt;

&lt;h2&gt;
  
  
  SIMD in Vector Similarity Search
&lt;/h2&gt;

&lt;p&gt;Distance calculations in vector similarity search can be optimized using Single-Instruction-Multiple-Data (SIMD) instructions available in the CPU. These special instructions can process &lt;strong&gt;multiple values in parallel with a single CPU instruction&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;x86_64 architectures&lt;/strong&gt;, SIMD instructions are called Advanced Vector Extensions (AVX). The number of values AVX can process at a time depends on the SIMD register width supported by the CPU. Modern CPU microarchitectures, like Zen4 and Intel Sapphire Rapids, have 512-bit SIMD registers (AVX512), which can process 16 &lt;code&gt;float32&lt;/code&gt; values with one CPU instruction.&lt;/p&gt;

&lt;p&gt;Let’s look at the following C++ code (taken from the &lt;a href="https://github.com/ashvardanian/SimSIMD/blob/6951b006d8c27b89c91b78019c7af3714b7114f5/include/simsimd/spatial.h#L1520" rel="noopener noreferrer"&gt;SimSIMD&lt;/a&gt; codebase) that uses AVX512 SIMD to calculate the Euclidean distance (L2) between two vectors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;l2sq_f32_avx512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;__m512&lt;/span&gt; &lt;span class="n"&gt;d2_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_setzero&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;__m512&lt;/span&gt; &lt;span class="n"&gt;a_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_vec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nl"&gt;l2sq_f32_loop:&lt;/span&gt;
    &lt;span class="n"&gt;a_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_loadu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;b_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_loadu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__m512&lt;/span&gt; &lt;span class="n"&gt;d_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_sub_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;d2_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_fmadd_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d2_vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;l2sq_f32_loop&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reduce_f32x16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d2_vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical part is the &lt;code&gt;l2sq_f32_loop&lt;/code&gt; that loops through the vector dimensions. In each iteration, &lt;code&gt;_mm512_loadu_ps&lt;/code&gt; loads 16 packed single-precision values (32 bits each) into the SIMD register. We do this twice, once for vector &lt;code&gt;a&lt;/code&gt; and another for vector &lt;code&gt;b&lt;/code&gt;. Then, we do the L2 calculation by doing a subtraction (&lt;code&gt;_mm512_sub_ps&lt;/code&gt;) and a fused multiply-add (&lt;code&gt;_mm512_fmadd_ps&lt;/code&gt;) that accumulates the distances into a result register (&lt;code&gt;d2_vec&lt;/code&gt;). We keep repeating the loop until we have inspected all the vector dimensions (when &lt;code&gt;n == 0&lt;/code&gt;). Finally, we have to sum all the values in the SIMD register to get our total distance (&lt;code&gt;reduce_f32x16&lt;/code&gt;). We will obviate the explanation of this last step. &lt;/p&gt;

&lt;p&gt;On the other hand, &lt;strong&gt;ARM architectures&lt;/strong&gt; also provide SIMD instructions through NEON and SVE. NEON was introduced first, supporting SIMD over 128-bit registers (fitting 4 &lt;code&gt;float32&lt;/code&gt; values at a time). SVE was introduced later. Unlike AVX and NEON, SVE supports variable-size SIMD registers on its intrinsics through VLA (Variable Length Agnostic) programming. The latter alleviates technical debt as distance kernels no longer need hardware-dependent loop lengths. &lt;/p&gt;

&lt;p&gt;Let’s take a look now at a code with SVE intrinsics (also taken from &lt;a href="https://github.com/ashvardanian/SimSIMD/blob/6951b006d8c27b89c91b78019c7af3714b7114f5/include/simsimd/spatial.h#L799" rel="noopener noreferrer"&gt;SimSIMD&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;l2sq_f32_sve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;svfloat32_t&lt;/span&gt; &lt;span class="n"&gt;d2_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svdupq_n_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;svbool_t&lt;/span&gt; &lt;span class="n"&gt;pg_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svwhilelt_b32&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;svfloat32_t&lt;/span&gt; &lt;span class="n"&gt;a_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svld1_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;svfloat32_t&lt;/span&gt; &lt;span class="n"&gt;b_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svld1_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;svfloat32_t&lt;/span&gt; &lt;span class="n"&gt;a_minus_b_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svsub_f32_x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;d2_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svmla_f32_x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d2_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a_minus_b_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a_minus_b_vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;svcntw&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;d2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svaddv_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;svptrue_b32&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;d2_vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;svld1_f32&lt;/code&gt; is the SIMD intrinsic that loads the single-precision vector into the SIMD register. &lt;code&gt;svsub_f32_x&lt;/code&gt; and &lt;code&gt;svmla_f32_x&lt;/code&gt; do the subtraction and fused multiply-add, resp. The main difference between SVE and AVX512 is that each loop iteration length is not controlled with a constant but with another intrinsic (&lt;code&gt;svcntw()&lt;/code&gt;) that resolves to the register width of the underlying CPU. Recall that in the AVX512 code, we do &lt;code&gt;n-=16&lt;/code&gt; to advance through the loop. SVE has additional intricacies. For example, every intrinsic call must be predicated/masked. But we will not dive deep into SVE programming. &lt;/p&gt;

&lt;h2&gt;
  
  
  Back to the Gravitons: From 256-bit to 128-bit SVE registers
&lt;/h2&gt;

&lt;p&gt;Both Gravitons support NEON and SVE SIMD. In terms of NEON, both microarchitectures have 128-bit SIMD registers. However, in terms of SVE, Graviton4 has 128-bit registers, while Graviton3 has 256-bit registers. &lt;/p&gt;

&lt;p&gt;A smaller SIMD register &lt;em&gt;does not&lt;/em&gt; mean slower performance. Yes, every instruction call will process fewer values, but the performance also depends on the &lt;strong&gt;execution throughput&lt;/strong&gt; and &lt;strong&gt;latencies&lt;/strong&gt; of the used instructions. The &lt;em&gt;execution throughput&lt;/em&gt; is defined on ARM guides as "the maximum number of instructions per CPU cycle an instruction can achieve." The latter depends on the CPU design and the ports in which CPU instructions are dispatched. On the other hand, &lt;em&gt;latency&lt;/em&gt; is defined as "the delay (in clock cycles) that the instruction generates in a dependency chain."&lt;/p&gt;

&lt;p&gt;Let’s compare both microarchitectures execution throughput and latencies on the relevant instructions for our L2 distance kernel: &lt;code&gt;FMADD&lt;/code&gt; and &lt;code&gt;LOAD&lt;/code&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Execution Throughput
&lt;/h3&gt;

&lt;p&gt;The first number in each cell is the execution throughput. To translate this to effective throughput of &lt;em&gt;data&lt;/em&gt;, we multiply it by the size of the register that executes that instruction.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Microarchitecture&lt;/th&gt;
&lt;th&gt;NEON &lt;code&gt;FMADD&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;SVE &lt;code&gt;FMADD&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;NEON &lt;code&gt;LOAD&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;SVE &lt;code&gt;LOAD&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Graviton 3 (Neoverse v1)&lt;/td&gt;
&lt;td&gt;4 x 128 = 512&lt;/td&gt;
&lt;td&gt;2 x 256 = 512&lt;/td&gt;
&lt;td&gt;3 x 128 = 384&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2 x 256 = 512&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graviton 4 (Neoverse v2)&lt;/td&gt;
&lt;td&gt;4 x 128 = 512&lt;/td&gt;
&lt;td&gt;4 x 128 = 512&lt;/td&gt;
&lt;td&gt;3 x 128 = 384&lt;/td&gt;
&lt;td&gt;3 x 128 = 384&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*&lt;em&gt;These numbers are taken from the official microarchitecture guides of &lt;a href="https://developer.arm.com/documentation/109897/0600" rel="noopener noreferrer"&gt;Neoverse v1&lt;/a&gt; and &lt;a href="https://developer.arm.com/documentation/109898/0300/" rel="noopener noreferrer"&gt;Neoverse v2&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We see that Graviton4 maintains the execution throughput of floating-point arithmetic despite the smaller register width. From these numbers, only one stands out: the &lt;strong&gt;SVE LOAD&lt;/strong&gt;. Graviton3 can load 33% more data than Graviton4 in one CPU cycle, which is more than the upgrade in clock frequency from Graviton3 to Graviton4 (around 8%). This gives an advantage to Graviton3 if the data is cache resident. In fact, this is reflected in our read memory bandwidth benchmarks that show that when data is L1-resident, the read bandwidth using SVE intrinsics is 26% higher in Graviton3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faterxk136oep8x9laq24.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faterxk136oep8x9laq24.png" alt="SVE Read Memory Bandwidth" width="50%"&gt;&lt;/a&gt;&lt;br&gt;SVE Read Memory Bandwidth on Graviton3 (r7g.metal) and Graviton4 (r8g.metal-24xl)
  &lt;/p&gt;

&lt;p&gt;However, in vector search, we are usually on the case in which data is in L3/DRAM (in IVF indexes or full scans) or, at best, in L2 (e.g., in the top layer of HNSW indexes). Here, the difference in read bandwidth is small.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latencies
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Microarchitecture&lt;/th&gt;
&lt;th&gt;NEON &lt;code&gt;FMADD&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;SVE &lt;code&gt;FMADD&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;NEON &lt;code&gt;LOAD&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;SVE &lt;code&gt;LOAD&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Graviton 3 (Neoverse v1)&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graviton 4 (Neoverse v2)&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*&lt;em&gt;These numbers are taken from the official microarchitecture guides of &lt;a href="https://developer.arm.com/documentation/109897/0600" rel="noopener noreferrer"&gt;Neoverse v1&lt;/a&gt; and &lt;a href="https://developer.arm.com/documentation/109898/0300/" rel="noopener noreferrer"&gt;Neoverse v2&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The latencies are the same in both CPUs. However, the total latency cost to load the same amount of data is higher in Graviton4 due to the smaller register width. For example, the latency cost of calling &lt;code&gt;FMADD&lt;/code&gt; in Graviton4 to process 128 bits (4 &lt;code&gt;float32&lt;/code&gt; values) is 4 cycles. However, in Graviton3, we spend the same 4 cycles to process 256 bits (8  &lt;code&gt;float32&lt;/code&gt; values). &lt;/p&gt;

&lt;p&gt;Also, recall that in our AVX and SVE code of the L2 distance kernel, each iteration of the loop depends on the previous one since all the FMADDs are accumulating distances on the same SIMD register. This creates a dependency chain, making it harder for the CPU to leverage features such as out-of-order execution. Therefore, &lt;strong&gt;the SIMD register width becomes more critical in similarity calculations&lt;/strong&gt;, as instructions may not be able to be executed in parallel up to their maximum throughput. &lt;/p&gt;

&lt;p&gt;Of course, it is hard to precisely determine the impact of these extra cycles and less effective throughput on the bigger picture of a vector similarity search query. Each CPU microarchitecture is wildly different and implements different mechanisms to maintain performance despite having a smaller register width. Therefore, let's run some benchmarks!&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks: Graviton 3 vs Graviton 4
&lt;/h2&gt;

&lt;p&gt;We compared the queries per second (QPS) and queries per dollar (QP$) given by Graviton3 (&lt;code&gt;r7g.2xlarge&lt;/code&gt;, $0.4284/h in &lt;code&gt;us-east-1&lt;/code&gt;) and Graviton4 (&lt;code&gt;r8g.2xlarge&lt;/code&gt;, $0.4713/h) on a variety of vector search scenarios. The machines have Ubuntu 24 with GCC 13 and LLVM 18. &lt;/p&gt;

&lt;p&gt;We used 2 datasets contrasting in their dimensionality: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI (D=1536, N=1M)&lt;/li&gt;
&lt;li&gt;SIFT (D=128, N=1M)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these single-threaded benchmarks, we used &lt;a href="https://github.com/facebookresearch/faiss" rel="noopener noreferrer"&gt;FAISS&lt;/a&gt; (compiled with SVE) and &lt;a href="https://github.com/unum-cloud/usearch" rel="noopener noreferrer"&gt;USearch&lt;/a&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  IVF Indexes in FAISS (&lt;code&gt;float32&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Graviton3 performs better at all recall levels and even more so in the dataset of higher dimensionality. Things look worse for Graviton4 when we consider its price, as Graviton3 is a cheaper machine (around 10% cheaper). &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-ivf_flat_faiss.png%23center" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-ivf_flat_faiss.png%23center" alt="Graviton4 vs Graviton3 FAISS IVF Flat index"&gt;&lt;/a&gt;&lt;br&gt;FAISS IVF Flat index: Graviton4 vs Graviton3
  &lt;/p&gt;

&lt;p&gt;Notice how the gap closes on the dataset with a lower dimensionality. However, it is nowhere near the 30% performance improvement AWS promises when jumping from Graviton3 to Graviton4.&lt;/p&gt;

&lt;h3&gt;
  
  
  HNSW Indexes in USearch
&lt;/h3&gt;

&lt;p&gt;On &lt;code&gt;float32&lt;/code&gt;, again, Graviton3 performs better at all recall levels. Graviton3 delivers 5% more QPS and 15% more QP$ in the OpenAI dataset at the highest recall level. And 13% and 25% more QPS and QPS, resp., in the SIFT/128 dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-hnsw_flat_usearch.png%23center" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-hnsw_flat_usearch.png%23center" alt="Graviton4 vs Graviton3 USearch HNSW Flat index"&gt;&lt;/a&gt;&lt;br&gt;USearch HNSW Flat index: Graviton4 vs Graviton3
  &lt;/p&gt;

&lt;p&gt;The story is the same for quantized vectors. Here, we show only the performance at the highest possible recall level:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-hnsw-openai.png%23center" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-hnsw-openai.png%23center" alt="On the OpenAI/1536 dataset"&gt;&lt;/a&gt;&lt;br&gt;On the OpenAI/1536 dataset
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-hnsw-sift.png%23center" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-hnsw-sift.png%23center" alt="On the SIFT/128 dataset"&gt;&lt;/a&gt;&lt;br&gt;On the SIFT/128 dataset
  &lt;/p&gt;

&lt;p&gt;In the dataset of smaller dimensionality (SIFT/128), G4 takes the upper hand on QPS, but G3 remains competitive on QP$. Here, the bigger L2 of G4 could be kicking in due to the entry nodes of the HNSW index being cached more efficiently. Also, smaller vectors imply fewer calls to SIMD instructions, which benefits G4. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note 1&lt;/strong&gt;: USearch switches to NEON for 1-bit vectors if the vectors are of 128 dimensions, which is the case here. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note 2&lt;/strong&gt;: We did not benchmark quantized vectors in FAISS because FAISS does asymmetric distance calculations. This in itself would not be a problem, but for ARM, FAISS does not use SIMD instructions to go from the 8-bit, 6-bit, 4-bit domain to the &lt;code&gt;float32&lt;/code&gt; domain. This leads to poor performance in both architectures (&amp;gt;6x slower than Zen4).&lt;/p&gt;

&lt;h3&gt;
  
  
  Raw Distance Calculations (1-to-many)
&lt;/h3&gt;

&lt;p&gt;We ran a standalone benchmark of L2 distance calculations to eliminate possible artifacts and overhead introduced by vector systems. Here, we used randomly generated &lt;code&gt;float32&lt;/code&gt; collections of different sizes and dimensionalities. The collections range from being small enough to fit in L1 and large enough to spill to DRAM. Our code is as simple as it can get: Put the vectors in memory and do 1-to-many distance calculations with the L2 kernels taken from SimSIMD. Here, we do not do a KNN search; we only do pure distance calculations. We repeat this experiment thousands of times to warm up the caches. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NEON vs NEON&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When using NEON kernels, &lt;strong&gt;Graviton4 is, on average, 10% faster than Graviton3&lt;/strong&gt; across all settings. This improvement is on par with the increase in clock frequency between both microarchitectures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SVE vs SVE&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When switching from NEON to SVE, &lt;strong&gt;Graviton3 saw a 37% performance improvement over its NEON counterpart&lt;/strong&gt;. &lt;a href="https://ashvardanian.com/posts/simsimd-faster-scipy/#tails-of-the-past-the-significance-of-masked-loads" rel="noopener noreferrer"&gt;Ash Vardanian&lt;/a&gt; also reported similar findings. However, Graviton4 doesn’t find any benefits when using SVE (in fact, compiling FAISS to SVE or not yields the same performance). Actually, SVE is &lt;em&gt;slightly&lt;/em&gt; slower than NEON in G4. These could be fluctuations/noise of the benchmarks or the overhead of masked operations in SVE.&lt;/p&gt;

&lt;p&gt;When comparing G3 SVE vs G4 SVE, these are the results across scenarios:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-purescan.png%23center" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.lkuffo.com%2Fg3-vs-g4%2Fgravitons-purescan.png%23center" alt="Graviton3 is X times faster than Graviton4"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graviton3 is, on average, 31% faster than Graviton4.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;You can check these fully disaggregated benchmarks for each collection size and dimensionality on &lt;a href="https://docs.google.com/spreadsheets/d/1kUu96o_vc_-aAEBhA4_-WLjOL1AIRdDzlWQZmrl7Wps/edit?usp=sharing" rel="noopener noreferrer"&gt;this spreadsheet&lt;/a&gt;. We would like to bring forward a few things regarding these benchmarks: (i) The wider the vectors, the bigger the gap between G3 and G4. (ii) If the vectors fit in the cache, Graviton3 can be almost 2x faster than Graviton4. (iii) Only at a dimensionality of 8, the tables turn, and Graviton3 is slightly slower than Graviton4. &lt;/p&gt;

&lt;p&gt;The benchmarks are clear: &lt;strong&gt;if you are doing vector search with a vector library that supports SVE, you should use Graviton 3. It is cheaper, and it will also deliver more raw performance in the majority of scenarios.&lt;/strong&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Why did Graviton4 regress the SVE register width?
&lt;/h2&gt;

&lt;p&gt;While we do not have a proper answer to this question, we can speculate. &lt;/p&gt;

&lt;p&gt;Currently, most code for ARM is written in NEON, where Graviton4 is better than Graviton3. A possibility is that AWS went for a CPU that performed better in existing workloads and benchmarks, which usually use code written in NEON. In fact, &lt;a href="https://lemire.me/blog/2024/07/10/benchmarking-arm-processors-graviton-4-graviton-3-and-apple-m2/" rel="noopener noreferrer"&gt;Daniel Lemire&lt;/a&gt; and &lt;a href="https://chipsandcheese.com/p/arms-neoverse-v2-in-awss-graviton-4" rel="noopener noreferrer"&gt;Chips and Cheese&lt;/a&gt; benchmarks on Graviton4 vs Graviton3 used NEON code. In other words, &lt;strong&gt;SVE is not yet widely used.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another possible reason is that SIMD instructions can be fully pipelined in many workloads. However, similarity calculations are different because there is a dependency chain between multiple calls to SIMD instructions. The latter benefits wider registers, especially on vectors of high dimensionalities. While we haven’t done benchmarks on SVE for other types of workloads, our intuition is that if the workload can be fully pipelined, it would be faster in Graviton4.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Remarks
&lt;/h2&gt;

&lt;p&gt;I have found myself inside a (nice) rabbit hole of CPU microarchitecture design and performance. We have actually done the same experiments presented in this blog post for 5 microarchitectures (Intel SPR, Graviton3, Graviton4, Zen3, and Zen4). One of our most interesting findings is that &lt;strong&gt;carefully choosing the microarchitecture for your vector search use case can give you up to 3x more queries per second and queries per dollar&lt;/strong&gt;. This is the case, for instance, with Zen4 in IVF indexes compared to Intel Sapphire Rapids (despite the latter being a CPU with better specs). I am writing a blog post about that, so keep tuned!&lt;/p&gt;

&lt;p&gt;Regarding the Gravitons, it is nonetheless a weird decision to halve the SIMD register size in the generational jump from Graviton3 to Graviton4. Perhaps in terms of the engineering design of the core, it is hard to keep supporting NEON registers of 128 bits AND SVE registers of double the size. &lt;/p&gt;

&lt;p&gt;Not so far ago, Daniel Lemire commented that "&lt;a href="https://x.com/lemire/status/1889150598001422645" rel="noopener noreferrer"&gt;AMD is mopping the floor with ARM in terms of SIMD&lt;/a&gt;." &lt;br&gt;
&lt;iframe class="tweet-embed" id="tweet-1889150598001422645-33" src="https://platform.twitter.com/embed/Tweet.html?id=1889150598001422645"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-1889150598001422645-33');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1889150598001422645&amp;amp;theme=dark"
  }



 &lt;/p&gt;

&lt;p&gt;It is no rocket science: a smaller register will impact the performance of some workloads, even if the execution throughput is kept the same. Of course, when comparing different microarchitectures, more things come into play, such as memory access latency, read memory bandwidth, and the data-access patterns of the workload. &lt;strong&gt;Ultimately, &lt;em&gt;your&lt;/em&gt; decision to use a microarchitecture should be based on data-driven benchmarks with your own use case.&lt;/strong&gt; As you may have noticed, the performance depends on various factors, such as the search algorithm and the size of the vectors. Perhaps an interesting follow-up post would be to test both architectures by doing vector search under a multi-threaded setting.&lt;/p&gt;

&lt;p&gt;For now, it seems that aside from the portability benefits, there is currently not much payoff in migrating NEON code to SVE, especially if the cores used by AWS will keep the SVE register size on par with NEON. The only exception would be when one needs to use &lt;a href="https://ashvardanian.com/posts/simd-set-intersections-sve2-avx512/" rel="noopener noreferrer"&gt;an intrinsic that is only available in SVE&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Kudos to &lt;a href="https://ashvardanian.com/" rel="noopener noreferrer"&gt;Ash&lt;/a&gt; for giving input on these findings and for putting them forward to the ARM team. We are still awaiting their input.&lt;/p&gt;




&lt;p&gt;Visit &lt;a href="https://lkuffo.com/" rel="noopener noreferrer"&gt;my website&lt;/a&gt; to know more about me ;-)&lt;/p&gt;

</description>
      <category>programming</category>
      <category>vectordatabase</category>
      <category>cpp</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
