DEV Community: Jonathan Ellis

The Best Embedding Models for Information Retrieval in 2025

Jonathan Ellis — Thu, 09 Jan 2025 17:59:20 +0000

The just-released Voyage-3-large is the surprise leader in embedding relevance

With the exception of OpenAI (whose text-embedding-3 models from March 2023 are ancient in light of the pace of AI progress), all the prominent commercial vector embedding vendors released a new version of their flagship models in late 2024 or early 2025.

Here’s how the latest and greatest proprietary and open-source models stack up against each other in DataStax Astra DB vector search.

Models tested

I tested eight commercial models from three categories:

The Gemini and OpenAI models are still the default option for most people. (Google has released text-embedding-005, but only on its enterprise Vertex platform so far. I tested the older text-embedding-004 that’s available via the Gemini API.)
Jina, Cohere, and Voyage are the third-party vendors enjoying the most success with embeddings models designed for retrieval.
NVIDIA is of course the 800 lb. gorilla in its home market, looking to commoditize its complements by providing high-quality models licensed to run on NVIDIA hardware. They have previously offered fine tunes of the e5 embeddings model, the llama-based model evaluated here is the first of a new generation for them.

I also tested three open models:

Stella is the top-performing model on the MTEB retrieval leaderboard that allows commercial use, so I tested both the 400m and 1.5b variants. (The bte-en-icl model does slightly better, but that’s designed for few-shot use rather than zero-shot, so it’s a different paradigm than everything else here.)
ModernBERT Embed is a brand-new model based on the ModernBERT model from Answer.AI and LightOn AI. ModernBERT aims to improve on the foundational BERT model in both speed and accuracy, enabling models like Nomic’s Embed to inherit the same advantages.

Here are the details:

Models marked with * are trained with Matroyshka techniques, meaning it is designed to keep the most important information in the first dimensions of the output vector, allowing the vector to be truncated while preserving most of the semantic information. I only evaluated the largest, most accurate sizes for these models.

Datasets

These are test sets from the ViDoRe image search benchmark, OCR’d using Gemini Flash 1.5. Details on the datasets can be found in section 3.1 of the ColPali paper. Notably, TabFQuAD and Shift Project sources are in French; the rest are in English.

I picked these because most if not all of the classic text-search datasets are being trained on by model developers for whom it’s more important to get to the top of the MTEB leaderboard than to build something actually useful. By OCRing data from image search datasets, I believe I was able to give these models data that they haven’t seen before.

Cost

Of course, cost is also a concern when evaluating models. Here’s a way to visualize cost versus performance for these models.

I estimated costs for NVIDIA llama v1, ModernBERT Embed, and the Stella models by multiplying their parameter counts by Jina v3’s price/parameter count, since Jina is the only proprietary model for which I have both hosted pricing and parameter counts available.

Observations

ModernBERT Embed and Gemini text-embedding-004 are trained on English only, so their results are not included for the French datasets. The other models are all multilingual. (The Stella models contain “en” in their full name, but they do just fine on the French datasets, so I left them in.)
Voyage-3-large is in a league of its own. None of the others consistently comes close. After also sweeping the reranker results, Tengyu Ma and his team are doing phenomenal work.
There seems to be a general trend towards larger-dimension outputs for models prioritizing the highest relevance. OpenAI’s v3-large was early to the over-2k output size, but NVIDIA’s llama model and voyage-3-large have also moved up to 2048 dimensions. Not coincidentally, these are the three models delivering the most accurate results. And yet, Voyage-3-lite delivers results very nearly as good as NVIDIA llama and OpenAI v3-large in only 512 output dimensions.
Sitting a notch below voyage-3-large, OpenAI’s v3-large and NVDIA’s llama-v1 are quite good.
Stella is also in this second tier, which represents incredibly impressive work from its author, Dun Zhang. After dropping the Stella model like a bomb on HuggingFace, Zhang released a whitepaper in late December giving a few more details. However, the 4x larger stella-1.5b is not significantly more accurate than stella-400m.
Gemini 004 is in a class by itself with modest performance but the low price of Free. This comes with a reasonable rate limit of 1500 RPM; the only downside is that there’s no way to pay for more throughput.
Jina v3 and Cohere v3 are at the bottom and are strictly outcompeted: as much as I love Cohere v3, you can use other models with better performance, for less money.

Conclusion

Voyage continues to kill it with their recent releases; if you want the maximum possible relevance, there is a wide gap between voyage-3-large and the group of models that collectively take second place. Voyage-3-lite is also in a strong position with respect to cost:performance, coming very close to openai-v3-large performance for about 1/5 of the price – and with a much smaller output size, meaning searches will be proportionally faster.

On the open source side, Stella is an excellent option out-of-the-box, and small enough to easily fine-tune for even better performance. It’s crazy to me that this came from a single developer.

It’s an exciting time to build with AI!

ColBERT Live! Makes Your Vector Database Smarter

Jonathan Ellis — Tue, 01 Oct 2024 14:53:41 +0000

ColBERT is a vector search algorithm that combines the signal from multiple vectors per passage to improve search relevance compared to single-vector retrieval. In particular, ColBERT largely solves the problems with out-of-domain search terms. My introduction to ColBERT gives a simple, pure Python implementation of ColBERT search.

But for production usage, the only option until now has been the Stanford ColBERT library and the Ragatouille wrapper. These are high performance libraries, but they only support use cases that can fit in a two-stage pipeline of (1) ingest all your data and then (2) search it. Updating indexed data is not supported, and integrating with other data your application cares about (such as ACLs) or even other parts of the indexed data (creation date, author, etc) is firmly in roll-your-own territory.

This post introduces the ColBERT Live! library for production-quality ColBERT search with an off-the-shelf vector database. This addresses both of the limitations of Stanford ColBERT:
ColBERT Live! enables you to combine ColBERT search with other predicates or filters natively at the database level instead of trying to sync and combine multiple index sources in application code.

ColBERT Live! supports realtime inserts and updates without rebuilding the index from scratch (assuming that this is supported by the underlying database, of course).

Background

ColBERT breaks both queries and documents into vector-per-token, then computes the maximum similarity (maxsim) between each query vector and each document vector, and sums them together to arrive at the overall score. This gives more accurate results than hoping that your single-vector embedding model was able to capture the full semantics of your document in one shot.

Image from Benjamin Clavié’s excellent article on ColBERT token pooling

There are thus three steps to a ColBERT search:

Identify a subset of documents as candidates for maxsim computation
Fetch all the embeddings for each candidate
Compute the maxsim for each candidate

Since maxsim score computation is O(N x M) where N is the number of query vectors and M is the number of document vectors, ColBERT search is only feasible if we can restrict the candidate set somehow. In fact, each of these three steps can be optimized better than the brute force approach.

A stronger embeddings model

But first, let’s talk about how we generate embeddings in the first place. ColBERT Live! uses the Answer AI colbert-small-v1 model by default. Do read author Benjamin Clavié’s detailed explanation, but the summary is: this model is smaller (faster to compute embeddings), more aggressive at dimension reduction (faster searches), and better-trained (more relevant results) than the original.

Because all three stages of ColBERT search are O(N) in the embedding size, and colbert-small-v1 embeddings are 25% smaller than colbert-v2’s, we would expect to see about a 25% improvement in search times after switching from the colbert-v2 model to answerai-colbert-small-v1 and that is in fact what we observe.

Better candidate generation

In the original ColBERT research paper, and in my first article linked above, candidates are identified by performing standard single-vector searches for each of the query embeddings with the top k’ = k/2 , then taking the union of those results.

The ColBERTv2 paper adds a custom index and a new candidate generation algorithm. The specifics are tightly coupled with the new index implementation, but the basic idea is simple enough:

Fetch and score the document embeddings Dj that are nearest to each query vector Qi
Group Dj by document, keeping only the best score from each document for each Qi
Sum the scores for each document across all Qi
Retrieve all embeddings from the top M documents for full maxsim scoring

Here is how this is implemented in colbert-live:

def search(self,
           query: str,
           n_ann_docs,
           n_maxsim_candidates
           ) -> List[Tuple[Any, float]]:
   """
   Q = self.encode_query(query)
   query_encodings = Q[0]

   # compute the max score for each term for each doc
   chunks_per_query = {}
   for n, rows in enumerate(self.db.query_ann(query_encodings, n_ann_docs)):
       for chunk_id, similarity in rows:
           key = (chunk_id, n)
           chunks_per_query[key] = max(chunks_per_query.get(key, -1), similarity)
   if not chunks_per_query:
       return []  # empty database

   # sum the partial scores and identify the top candidates
   chunks = {}
   for (chunk_id, qv), similarity in chunks_per_query.items():
       chunks[chunk_id] = chunks.get(chunk_id, 0) + similarity
   candidates = sorted(chunks, key=chunks.get, reverse=True)[:n_maxsim_candidates]

   # Load document encodings
   D_packed, D_lengths = self._load_data_and_construct_tensors(candidates)
   # Calculate full ColBERT scores
   scores = colbert_score_packed(Q, D_packed, D_lengths, config=self._cf)

   # Map the scores back to chunk IDs and sort
   results = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

   # Convert tensor scores to Python floats and return top k results
   return [(chunk_id, score.item()) for chunk_id, score in results[:k]]

This accelerates the search by a factor of about 3 for equal relevance.

A practical note: since relevance will suffer if the correct candidates aren’t found during the ANN query stage, it’s important to use a very high ANN top k (n_ann_docs). Fortunately, the performance impact of doing this is low compared to increasing the maxsim candidate pool size.

Document embedding pooling

ColBERT Live! supports document embedding pooling, which aims to eliminate low-signal vectors from the document embeddings. Author Benjamin Clavié has a full writeup here but the short version is that document embedding pooling reduces the number of vectors per document by approximately the pooling factor, i.e. a corpus indexed with pooling factor of 2 would have ½ as many vectors as an unpooled index, and with pooling factor of 3 ⅓ as many.

This means that less work has to be done in stages (2) and (3) of the search. In ColBERT Live!, a pool factor of 2 reduces search times by about 1/3. We’ll look at the tradeoff vs relevance below. (The details for ColBERT Live! are not quite the same as for Stanford ColBERT in Benjamin’s article.)

Query embedding pooling

Inspired by Benjamin’s work on document pooling, I implemented a similar approach for query embeddings. Compared to pooling document embeddings, I found that:

Clustering by cosine distance threshold works much better than clustering with a target number of clusters and euclidean distance
Re-normalizing the pooled vectors is critical to preserving relevance

Here's the code:

def _pool_query_embeddings(query_embeddings: torch.Tensor, max_distance: float, use_gpu: bool) -> torch.Tensor:
   # Convert embeddings to numpy for clustering
   embeddings_np = query_embeddings.cpu().numpy()
   # Cluster
   clustering = AgglomerativeClustering(
       metric='cosine',
       linkage='average',
       distance_threshold=max_distance,
       n_clusters=None
   )
   labels = clustering.fit_predict(embeddings_np)

   # Pool the embeddings based on cluster assignments
   pooled_embeddings = []
   for label in set(labels):
       cluster_indices = np.where(labels == label)[0]
       cluster_embeddings = query_embeddings[cluster_indices]
       if len(cluster_embeddings) > 1:
           # average the embeddings in the cluster
           pooled_embedding = cluster_embeddings.mean(dim=0)
           if use_gpu:
               pooled_embedding = pooled_embedding.cuda()
           # re-normalize the pooled embedding
           pooled_embedding = pooled_embedding / torch.norm(pooled_embedding, p=2)
           pooled_embeddings.append(pooled_embedding)
       else:
           # only one embedding in the cluster, no need to do extra computation
           pooled_embeddings.append(cluster_embeddings[0])

   return torch.stack(pooled_embeddings)

Query embedding pooling reduces the number of query vectors and hence the work done in stages (1) and (3) of the search. Using a distance of 0.03 with the Answer AI model reduces search time by about 10%, while improving relevance on most BEIR datasets.

Combining document and query embedding pooling gives a compounded speedup, while improving the relevance problems that we see when using document embedding pooling alone.

Pooling effect on search relevance and speed

To visualize this, here’s a look at NDCG@10 scores for some of the BEIR datasets, normalized to the baseline score without pooling (at n_ann_docs=240, n_maxsim_candates=20). Query embedding pool distance is 0.03 (the default in ColBERT Live!, based on the histogram of distances between ColBERT embedding vectors), and document pooling clustering is 2.

Query embedding pooling is virtually a free lunch, improving relevance while also making searches faster. (The exception is arguana, which tests duplicate detection. Neither ColBERT nor ColBERT Live! optimizes for this use case.)

Document embedding pooling is more of a mixed bag, offering a significant speedup but also a significant hit to accuracy, although interestingly, adding query embedding pooling on top sometimes helps more than proportionally.

To see if the trade is worth it, let’s look at more data points. All of these use query embedding pooling distance of 0.03, but vary the other search parameters to see where the optimal tradeoff between speed and accuracy is. The X-axis is NDCG@10 score, and the Y axis is queries per second.

Points that are strictly worse in both QPS or NDCG for a given series are not shown. The point labels are doc pool factor, n_ann_docs, n_maxsim_candidates: ndcg@10, and the lines are color-coded by dataset. So the farthest point on the left of 2, 120, 20: 0.16 in red means that with doc_pool_factor=2, n_ann_docs=120, n_maxsim_candidates=20, we scored 0.16 on the scidocs dataset with about 9 QPS.

(There is only a single point shown for quora because all the other settings scored worse on both QPS and NDCG.)

My takeaways are:

Document embedding pooling factor of 2 with query embedding pooling distance of 0.03 always achieves a reasonable place on speed:relevance, so that is what ColBERT Live! defaults to.
You can increase relevance from that starting point but the cost is significant. Usually, you have to increase n_maxsim_candidates to do so.
If you do need the extra relevance, increasing n_maxsim_candidates is a better way to do that than eliminating document embedding pooling.
However, increasing document embedding pooling more than 2 is often a bad trade

Using ColBERT Live!

If you look at the code sample under “better candidate generation”, you will see that there are two methods that must be implemented by the user:

db.query_ann, to perform a search for a single query vector
db.query_chunks, to load all the embeddings associated with a given document (called by _load_data_and_construct_tensors)

In fact these are the only two methods that need to be implemented. You can do this at the lowest level by implementing the DB abstract class, or you can subclass a vendor-specific class like AstraDB that gives you convenient tools for schema management and parallel querying.

The ColBERT Live! Repo includes a full example of creating a simple command-line tool to add and search documents to an Astra database.
When to use ColBERT Live!
ColBERT Live! Incorporates the latest techniques from Stanford ColBERT and Answer.AI and introduces new ones to reduce the overhead of combining multiple vectors while maintaining high query relevance.

Consider using ColBERT Live! if you need a robust, production-ready semantic search that offers state-of-the-art performance with out-of-domain search terms that can also integrate with your existing vector database.

Get it from pypi (pip install colbert-live) or check it out on GitHub!

Indexing All of Wikipedia on a Laptop

Jonathan Ellis — Tue, 18 Jun 2024 20:07:25 +0000

In November, Cohere released a dataset containing all of Wikipedia, chunked and embedded to vectors with their multilingual-v3 model. Computing this many embeddings yourself would cost in the neighborhood of $5000, so the public release of this dataset makes creating a semantic, vector-based index of Wikipedia practical for an individual for the first time.

Here’s what we’re building:

You can try searching the completed index on a public demo instance here.

Why this is hard

Sure, the dataset is big (180GB for the English corpus), but that’s not the obstacle per se. We’ve been able to build full-text indexes on larger datasets for a long time.

The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction. Larger datasets could be split into segments, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall. (In their latest release, Lucene attempts to mitigate this by processing segments in parallel with multiple threads, but obviously (1) this only gives you a constant factor of improvement before you run out of CPU cores and (2) this does not improve throughput.)

Specifically, if you’re indexing 1536-dimension vectors (the size of ada002 or openai-v3-small), then you can fit about 5M vectors and their associated edge lists in a 32GB index construction RAM budget.

JVector, the library that powers DataStax Astra DB vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop.

Requirements

Linux or MacOS. It will not work on Windows because ChronicleMap, which we are going to use for the non-vector data, is limited to a 4GB size there. (If you are interested enough, you could shard the Map by vector id to keep each shard under 4GB and still have O(1) lookup times.)
About 180GB of free space for the dataset, and 90GB for the completed index.
Enough RAM to run a JVM with 36GB of heap space during construction (~28GB for the index, 8GB for GC headroom).
Disable swap before building the index. Linux will aggressively try to cache the index being constructed to the point of swapping out parts of the JVM heap, which is obviously counterproductive. In my test, building with swap enabled was almost twice as slow as with it off.

Building and searching the index

Check out the project: $ git clone [https://github.com/jbellis/coherepedia-jvector](https://github.com/jbellis/coherepedia-jvector) $ cd coherepedia-jvector
Edit config.properties to set the locations for the dataset and the index.
Run pip install datasets. (Setting up a venv or conda environment first is recommended but not strictly necessary.)
Run python download.py. This downloads the 180 GB dataset to the location you configured. For me that took about half an hour.
Run ./mvnw compile exec:exec@buildindex. This took about 5 and a half hours on my machine (with an i9-12900 CPU).
Run ./mvnw compile exec:exec@serve and open a browser to http://localhost:4567. Search away!

How it works

We’re using JVector for the vector index and Chronicle Map for the article data. There are several things I don’t love about Chronicle Map, but nothing else touches it for simple disk-based key/value performance.

The full source of the index construction class is here. I’ll explain it next in pieces.

Compression parameters

JVector is based on the DiskANN vector index design, which performs an initial search using vectors compressed lossily with product quantization (PQ) in memory, then reranks the results using high-resolution vectors from disk. However, while DiskANN stores full, uncompressed vectors to perform reranking, JVector is able to improve on that using Locally-Adaptive Quantization (LVQ) compression.

To set this up, we’ll first load some vectors into a RandomAccessVectorValues (RAVV) instance. RAVV is a JVector interface for a vector container; it could be List or Map based, in-memory or on-disk. In this case we’ll use a simple List-backed RAVV. We’ll compute the parameters for both compressions (kmeans clustering for PQ, global mean for LVQ) from a single shard of the dataset. At about 110k rows, this is enough data to have a statistically valid sample.

Next, we compute the PQ compression codebook; we’re compressing the vectors by a factor of 64, because the Cohere v3 embeddings can be PQ-compressed that much without losing accuracy, after reranking. Binary Quantization only gives us 32x compression and is less accurate.

Finally, we need to set up LVQ. LVQ gives us 4x compression while losing no measurable accuracy over the full uncompressed vectors, resulting in both a smaller footprint on disk and faster searches. (I thank the vector search team at Intel Research for pointing this out to us.)

GraphIndexBuilder

Next, we need to instantiate and configure our GraphIndexBuilder.

This instantiates a JVector GraphIndexBuilder and connects it to an OnDiskGraphIndexWriter, and tells it to use the PQ-compressed vectors list (which starts empty and will grow as we add vectors to the index) during construction (in the BuildScoreProvider).

Chronicle Map and RowData

We’ll store article contents in RowData records. This content is what has been encoded as the corresponding vector in the dataset, and is what we want to return to the user in our search results.

To turn the vector index’s search results (a list of integer vector ids) into RowData, we store the RowData in a Map keyed by the vector id. This will be a lot of data, so we use ChronicleMap to store this on disk with a minimal in-memory footprint.

We need to tell ChronicleMap how large it’s going to be, both in entry count and entry size. Undersizing these will cause it to crash (my primary complaint about ChronicleMap), so we deliberately use a high estimate.

We do not need to explicitly tell ChronicleMap how to read and write RowData objects, instead we just have RowData implement Serializable. While ChronicleMap supports custom de/serialize code, it’s perfectly happy to use simple out-of-the-box serialization and since profiling shows that’s not a bottleneck for us we’ll just leave it at that.

Ingesting the data

We use Java’s parallel Streams to process the shards in parallel. For each row in each shard, we:

Add it to pqVectorsList
Call writer.writeInline to add the LVQ-compressed vector to disk
Call builder.addGraphNode – order is important because both (1) and (2) are used when we call addGraphNode
Call contentMap.put with the article chunk data.

You can look at the full source if you’re curious about forEachRow, it’s just standard “pull data out of Arrow” stuff.

When the build completes, you should see files like this:
$ ls -lh ~/coherepedia -rw-rw-r-- 1 jonathan jonathan 48G May 20 15:53 coherepedia.ann -rw-rw-r-- 1 jonathan jonathan 36G May 20 18:05 coherepedia.map -rw-rw-r-- 1 jonathan jonathan 2.5G May 20 15:53 coherepedia.pqv -rw-rw-r-- 1 jonathan jonathan 4.1K May 17 23:04 coherepedia.lvq -rw-rw-r-- 1 jonathan jonathan 1.1M May 17 23:04 coherepedia.pq

These are, respectively:

ANN: the vector index, containing the edge lists and LVQ-compressed vectors for reranking.
MAP: the map containing article data indexed by vector id.
PQV: PQ-compressed vectors, which are read into memory and used for the approximate search pass.
LVQ: the LVQ global mean, used during construction.
PQ: the PQ codebooks, used during construction.

Loading the index (after construction)

The code for serving queries is found in the WebSearch class. We’re using Spark (the web framework, not the big data engine) to serve a simple search form:

Construction needed a relatively large heap to keep the edge lists in memory. With that complete, we only need enough to keep the PQ-compressed vectors in memory; exec@serve is configured to use a 4GB heap.

WebSearch (the class behind exec@serve) first has a static initializer to load the PQ vectors and open the ChronicleMap. We also create a reusable GraphSearcher instance:

Performing a search

Executing a search and turning it into RowData for the user looks like this:

There are four “paragraphs” of code here, containing:

The call to getVectorEmbedding. This calls Cohere’s API to turn the search query (a String) into a vector embedding.
Creating approximate and reranking score functions. Approximate scoring is done through our product quantization, and reranking is done with the LVQ vectors in the index. Since the LVQ vectors are encapsulated in the index itself, we never need to explicitly deal with LVQ decoding; the index does it for us.
The call to searcher.search that actually does the query
Retrieving the RowData associated with the top vector neighbors using contentMap.

That’s it! We’ve indexed all of Wikipedia with high performance, parallel code in about 150 loc, and created a simple search server in another 100.

On my machine, searches (which each run in a single thread) take about 50ms. We would expect it to take over twice as long if this were split across multiple segments. We would also expect it to lose significant accuracy if searches were performed only with compressed vectors without reranking.

Conclusion

Indexing the entirety of English Wikipedia on a laptop has become a practical reality thanks to recent advances in the JVector library that will be part of the imminent 3.0 release. (Star the repo and stand by!) This article demonstrates how to do exactly that using JVector in conjunction with Chronicle Map, while also showcasing the use of LVQ to reduce index size while preserving accurate reranking.

To take advantage of the power of JVector alongside powerful indexing for non-vector data, rolled into a document api with support for realtime inserts, updates, and deletes, check out Astra DB.

Enjoy hacking with JVector and Astra DB!

Why Vector Compression Matters

Jonathan Ellis — Wed, 24 Apr 2024 23:35:31 +0000

Vector indexes are the hottest topic in databases because approximate nearest neighbor (ANN) vector search puts the R in RAG (retrieval-augmented generation). “Nearest neighbor” for text embedding models is almost always measured with angular distance, for instance, the cosine between two vectors. Getting the retrieval accurate and efficient is a critical factor for the entire application; failing to find relevant context — or taking too long to find it — will leave your large language model (LLM) prone to hallucination and your users frustrated.

Every general-purpose ANN index is built on a graph structure. This is because graph-based indexes allow for incremental updates, good recall and low-latency queries. (The one exception was pgvector, which started with a partition-based index, but its creators switched to a graph approach as fast as they could because the partitioning approach was far too slow.)

Caption: Visualization of searching for the closest neighbors of the red target vector in a graph index, starting from the purple entry point.

The well-known downside to graph indexes is that they are incredibly memory-hungry, because the entire set of vectors needs to live in memory. This is because you need to compare your query vector to the neighbors of each node you encounter as you expand your search through the graph, and this is very close to a uniformly random distribution of vectors being accessed. Standard database assumptions that 80% of your accesses will be to 20% of your data do not hold, so straightforward caching will not help you avoid a huge memory footprint.

For most of 2023, this flew under the radar of most people using these graph indexes simply because most users were not dealing with large enough data sets to make this a serious problem. That is no longer the case; with vectorized data sets like all of Wikipedia being easily available, it’s clear that vector search in production needs a better solution than throwing larger machines at the problem.

Breaking the memory barrier with DiskANN

Microsoft Research in 2019 proposed an elegant solution to the problem of large vector indexes in “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.” At a high level, the solution has two parts. First, (lossily) compress the vectors using product quantization (PQ). The compressed vectors are retained in memory instead of the full-resolution originals, reducing the memory footprint while also speeding up search.

JVector builds on the ideas in DiskANN to provide state-of-the-art vector search for Java applications. I’ve used the JVector Bench driver to visualize how recall (search accuracy) degrades when searching for the top 100 neighbors in data sets created by different embedding models against a small sample of chunked Wikipedia articles. (The data sets are built using the open source Neighborhood Watch tool.) Perfect accuracy would be a recall of 1.0.

It’s clear that recall suffers only a little at 4x and 8x compression, but falls off quickly after that.

That’s where the second part of DiskANN comes in. To achieve higher compression (which allows fitting larger indexes into memory) while making up for the reduced accuracy, DiskANN overqueries (searches deeper into the graph) and then reranks the results using the full-precision vectors that are retained on disk.

Here’s how recall looks when we add in overquery of up to 3x (such as retrieving the top 300 results using the PQ in-memory similarity) and then reranking to top 100. To keep the graph simple, we’ll focus on the openai-v3-small dataset:

With 3x overquery ( fetching the top 300 results for a top 100 query using the compressed vectors, and then reranking with full resolution), we can compress the openai-v3-small vectors up to 64x while maintaining or exceeding the original accuracy.

PQ + rerank is how JVector takes advantage of the strengths of both fast memory and cheap disk to deliver a hybrid index that offers the best of both worlds. To make this more user-friendly, DataStax Astra DB simplifies this to a single source_model setting when creating the index – tell Astra DB where your embeddings come from, and it will automatically use the optimal settings.

(If you want to go deeper on how PQ works, Peggy Chang wrote up the best explanation of PQ that I’ve seen — or you can always go straight to the source.)

Binary quantization

Binary quantization (BQ) is an alternative approach to vector compression, where each float32 component is quantized to either 0 (if negative) or 1 (if positive). This is extremely lossy! But it’s still enough to provide useful results for some embeddings sources if you overquery appropriately, which makes it potentially attractive because computing BQ similarity is so fast — essentially just the Hamming distance, which can be computed blisteringly quickly using SWAR (here’s OpenJDK’s implementation of the core method). Here’s how BQ recall looks with 1x to 4x overquery against the same five data sets:

This shows the limitations of BQ:

Even with overquery, too much accuracy is lost for most sources to make it back up.
OpenAI-v3-small is one of the models that compresses nicely with BQ, but we can get even more compression with PQ (64x!) without losing accuracy.

Thus, the only model that Astra compresses with BQ by default is ada-002, and it needs 4x overquery to match uncompressed recall there.

But BQ comparisons really are fast, to the point that they are almost negligible in the search cost. So wouldn’t it be worth pushing overquery just a bit higher for models that retain almost as much accuracy with BQ, like Gecko (the Google Vertex embedding model)?

The problem is that the more overquery you need to do to make up for the accuracy you lose to compression, the more work there is to do in the reranking phase, and that becomes the dominant factor. Here’s what the numbers look like for Gecko with PQ compressing the same amount as BQ (32x) and achieving nearly the same recall (BQ recall is slightly worse, 0.90 vs 0.92):

For 20,000 searches, BQ evaluated 131 million nodes while PQ touched 86 million. This is expected, because the number of nodes evaluated in an ANN search is almost linear with respect to the result set size requested.

As a consequence, while the core BQ approximate similarity is almost 4x faster than PQ approximate similarity, the total search time is 50% higher, because it loses more time in reranking and in the rest of the search overhead (loading neighbor lists, tracking the visited set, etc.).

Over the past year of working in this field, I’ve come to believe that product quantization is the quicksort of vector compression. It’s a simple algorithm and it’s been around for a long time, but it’s nearly impossible to beat it consistently across a wide set of use cases because its combination of speed and accuracy is almost unreasonably good.

What about multi-vector ranking?

I’ll conclude by explaining how vector compression relates to ColBERT, a higher-level technique that Astra DB customers are starting to use successfully.

Retrieval using a single vector is called dense passage retrieval (DPR), because an entire passage (dozens to hundreds of tokens) is encoded as a single vector. ColBERT instead encodes a vector-per-token, where each vector is influenced by surrounding context. This leads to meaningfully better results; for example, here’s ColBERT running on Astra DB compared to DPR using openai-v3-small vectors, compared with TruLens for the Braintrust Coda Help Desk data set. ColBERT easily beats DPR at correctness, context relevance, and groundedness.

The challenge with ColBERT is that it generates an order of magnitude more vector data than DPR. While the ColBERT project comes with its own specialized index compression, this suffers from similar weaknesses as other partition-based indexes; in particular, it cannot be constructed incrementally, so it’s only suitable for static, known-in-advance data sets.

Fortunately, it’s straightforward to implement ColBERT retrieval and ranking on Astra DB. Here’s how compression vs. recall looks with the BERT vectors generated by ColBERT:

The sweet spot for these vectors is PQ with 16x compression and 2x overquery; 32x PQ as well as BQ loses too much accuracy.

Product quantization enables Astra DB to serve large ColBERT indexes with accurate and fast results.

Beyond simple reranking

Supporting larger-than-memory indexes for Astra DB’s multitenant cloud database was table stakes for JVector. More recently, the JVector team has been working on validating and implementing improvements that go beyond basic DiskANN-style compression + reranking. Some of these include:

Anisotropic quantization from “Accelerating Large-Scale Inference with Anisotropic Vector Quantization”
Fused graphs that accelerate PQ computation as described in “Quicker ADC: Unlocking the hidden potential of Product Quantization with SIMD”
Locally-adaptive quantization from “Similarity search in the blink of an eye with compressed indices”
Larger-than-memory index construction using compressed vectors (first implemented in JVector, to my knowledge)

JVector currently powers vector search for Astra DB, Apache Cassandra and Upstash’s vector database, with more on the way. Astra DB constantly and invisibly incorporates the latest JVector improvements; try it out today.

How ColBERT Helps Developers Overcome the Limits of Retrieval-Augmented Generation

Jonathan Ellis — Mon, 25 Mar 2024 16:21:49 +0000

Retrieval-augmented generation (RAG) is by now a standard part of generative artificial intelligence (AI) applications. Supplementing your application prompt with relevant context retrieved from a vector database can dramatically increase accuracy and reduce hallucinations. This means that increasing relevance in vector search results has a direct correlation to the quality of your RAG application.

There are two reasons RAG remains popular and increasingly relevant even as large language models (LLMs) increase their context window:

LLM response time and price both increase linearly with context length.
LLMs still struggle with both retrieval and reasoning across massive contexts.

But RAG isn’t a magic wand. In particular, the most common design, dense passage retrieval (DPR), represents both queries and passages as a single embedding vector and uses straightforward cosine similarity to score relevance. This means DPR relies heavily on the embeddings model having the breadth of training to recognize all the relevant search terms.

Unfortunately, off-the-shelf models struggle with unusual terms, including names, that are not commonly in their training data. DPR also tends to be hypersensitive to chunking strategy, which can cause a relevant passage to be missed if it’s surrounded by a lot of irrelevant information. All of this creates a burden on the application developer to “get it right the first time,” because a mistake usually results in the need to rebuild the index from scratch.

Solving DPR’s challenges with ColBERT

ColBERT is a new way of scoring passage relevance using a BERT language model that substantially solves the problems with DPR. This diagram from the first ColBERT paper shows why it’s so exciting:

This compares the performance of ColBERT with other state-of-the-art solutions for the MS-MARCO dataset. (MS-MARCO is a set of Bing queries for whichMicrosoft scored the most relevant passages by hand. It’s one of the better retrieval benchmarks.) Lower and to the right is better.

In short, ColBERT handily outperforms the field of mostly significantly more complex solutions at the cost of a small increase in latency.

To test this, I created a demo and indexed over 1,000 Wikipedia articles with both ada002 DPR and ColBERT. I found that ColBERT delivers significantly better results on unusual search terms.

The following screenshot shows that DPR fails to recognize the unusual name of William H. Herndon, an associate of Abraham Lincoln, while ColBERT finds the reference in the Springfield article. Also note that ColBERT’s No. 2 result is for a different William, while none of DPR’s results are relevant.

ColBERT is often described in dense machine learning jargon, but it’s actually very straightforward. I’ll show how to implement ColBERT retrieval and scoring on DataStax Astra DB with only a few lines of Python and Cassandra Query Language (CQL).

The big idea

Instead of traditional, single-vector-based DPR that turns passages into a single “embedding” vector, ColBERT generates a contextually influenced vector for each token in the passages. ColBERT similarly generates vectors for each token in the query.

(Tokenization refers to breaking up input into fractions of words before processing by an LLM. Andrej Karpathy, a founding member of the OpenAI team, just released an outstanding video on how this works.)

Then, the score of each document is the sum of the maximum similarity of each query embedding to any of the document embeddings:

def maxsim(qv, document_embeddings):
    return max(qv @ dv for dv in document_embeddings)

def score(query_embeddings, document_embeddings):
    return sum(maxsim(qv, document_embeddings) for qv in query_embeddings)

(@ is the PyTorch operator for dot product and is the most common measure of vector similarity.)

That’s it — you can implement ColBERT scoring in four lines of Python! Now you understand ColBERT better than 99% of the people posting about it on X (formerly known as Twitter).

The rest of the ColBERT papers deal with:

How do you fine-tune the BERT model to generate the best embeddings for a given data set?
How do you limit the set of documents for which you compute the (relatively expensive) score shown here?

The first question is optional and out of scope for this writeup. I’ll use the pretrained ColBERT checkpoint. But the second is straightforward to do with a vector database like DataStax Astra DB.

ColBERT on Astra DB

There is a popular Python all-in-one library for ColBERT called RAGatouille; however, it assumes a static dataset. One of the powerful features of RAG applications is responding to dynamically changing data in real time. So instead, I’m going to use Astra’s vector index to narrow the set of documents I need to score down to the best candidates for each subvector.

There are two steps when adding ColBERT to a RAG application: ingestion and retrieval.

Ingestion
Because each document chunk will have multiple embeddings associated with it, I’ll need two tables:

CREATE TABLE chunks (
    title text,
    part int,
    body text,
    PRIMARY KEY (title, part)
);

CREATE TABLE colbert_embeddings (
    title text,
    part int,
    embedding_id int,
    bert_embedding vector<float, 128>,
    PRIMARY KEY (title, part, embedding_id)
);

CREATE INDEX colbert_ann ON colbert_embeddings(bert_embedding)
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

After installing the ColBERT library (pip install colbert-ai) and downloading the pretrained BERT checkpoint, I can load documents into these tables:

from colbert.infra.config import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint
from colbert.indexing.collection_encoder import CollectionEncoder

from cassandra.concurrent import execute_concurrent_with_args
from db import DB


def encode_and_save(title, passages):
    db = DB()
    cf = ColBERTConfig(checkpoint='checkpoints/colbertv2.0')
    cp = Checkpoint(cf.checkpoint, colbert_config=cf)
    encoder = CollectionEncoder(cf, cp)

    # encode_passages returns a flat list of embeddings and a list of how many correspond to each passage
    embeddings_flat, counts = encoder.encode_passages(passages)

    # split up embeddings_flat into a nested list
    start_indices = [0] + list(itertools.accumulate(counts[:-1]))
    embeddings_by_part = [embeddings_flat[start:start+count] for start, count in zip(start_indices, counts)]

    # insert into the database
    for part, embeddings in enumerate(embeddings_by_part):
        execute_concurrent_with_args(db.session,
                                     db.insert_colbert_stmt,
                                     [(title, part, i, e) for i, e in enumerate(embeddings)])

(I like to encapsulate my DB logic in a dedicated module; you can access the full source in my GitHub repository.)

Retrieval
Then retrieval looks like this:

def retrieve_colbert(query):
    db = DB()
    cf = ColBERTConfig(checkpoint='checkpoints/colbertv2.0')
    cp = Checkpoint(cf.checkpoint, colbert_config=cf)
    encode = lambda q: cp.queryFromText([q])[0]

    query_encodings = encode(query)
    # find the most relevant documents for each query embedding. using a set
    # handles duplicates so we don't retrieve the same one more than once
    docparts = set()
    for qv in query_encodings:
        rows = db.session.execute(db.query_colbert_ann_stmt, [list(qv)])
        docparts.update((row.title, row.part) for row in rows)
    # retrieve these relevant documents and score each one
    scores = {}
    for title, part in docparts:
        rows = db.session.execute(db.query_colbert_parts_stmt, [title, part])
        embeddings_for_part = [tensor(row.bert_embedding) for row in rows]
        scores[(title, part)] = score(query_encodings, embeddings_for_part)
    # return the source chunk for the top 5
    return sorted(scores, key=scores.get, reverse=True)[:5]

Here is the query being executed for the most-relevant-documents part (db.query_colbert_ann_stmt):

SELECT title, part
FROM colbert_embeddings
ORDER BY bert_embedding ANN OF ?
LIMIT 5

Beyond the basics: RAGStack

This article and the linked repository briefly introduce how ColBERT works. You can implement this today with your own data and see immediate results. As with everything in AI, best practices are changing daily, and new techniques are constantly emerging.

To make keeping up with the state of the art easier, DataStax is rolling this and other enhancements into RAGStack, our production-ready RAG library leveraging LangChain and LlamaIndex. Our goal is to provide developers with a consistent library for RAG applications that puts them in control of the step-up to new functionality. Instead of having to keep up with the myriad changes in techniques and libraries, you have a single stream, so you can focus on building your application. You can use RAGStack today to incorporate best practices for LangChain and LlamaIndex out of the box; advances like ColBERT will come to RAGstack in upcoming releases.

Why Pulsar Beats Kafka for a Scalable, Distributed Data Architecture

Jonathan Ellis — Fri, 22 Jul 2022 03:30:22 +0000

The leading open source event streaming platforms are Apache Kafka and Apache Pulsar. For enterprise architects and application developers, choosing the right event streaming approach is critical, as these technologies will help their apps scale up around data to support operations in production.

Everyone wants results faster. We want applications that know what we want, even before we know ourselves. We want systems that constantly check for fraud or security issues to protect our data. We want applications that are smart enough to react and change plans when faced with the unexpected. And we want those services to be continuously available.

These data-centric applications combine and use data to produce the right results. Event streaming is a key element in building these applications. Event streaming allows applications to take events – a customer action, a sensor log file, a transaction taking place – and checks them against specific criteria. If they match, the event is sent on and triggers an action. For modern applications based on microservices, this integrates different services with each other by acting as a message bus, and can be used to trigger those services to carry out processes or to take an action.

Implementing this in the right way is important. IDC estimates that companies will spend $8.5 billion on event streaming annually by 2024. Open source infrastructure will play an essential role in this. The leading open source event streaming platforms are Apache Kafka and Apache Pulsar. For enterprise architects and application developers, choosing the right event streaming approach is critical, as these technologies will help their apps scale up around data to support operations in production.

Apache Pulsar is the right choice to meet today’s developer criteria across two important trends today: developers want to make more use of cloud and microservice-based architectures to develop their applications, and they don’t want to be locked into proprietary APIs and services.

Microservices and Pulsar

When you put together applications based on a microservices model, you decouple all the components that make up the service and have them communicate with each other through messages conforming to well-defined APIs. Each component will then create and manage its own data based on the activities and requirements it supports.

Cloud databases like DataStax Astra or Amazon DynamoDB are a great fit for microservices-based applications because it’s so easy to provision dozens or hundreds of databases that each microservice can use independently of the others. There are no DBAs to become bottlenecks, and no quality of service problems from sharing a single database instance.

Astra is unique in offering built-in support for replication across multiple regions, allowing both a better user experience (data is closer to users) and improved reliability (even in the face of outages that take down an entire cloud region). This was a straightforward extension of the same properties in Apache Cassandra that Astra is based on.

But besides the database, microservices-based applications need a communication layer to route messages between services. Apache Kafka is often used for this purpose today, but Kafka was developed to run in a single region and does not offer built-in, cross-datacenter replication. This is one of the problems that Apache Pulsar was created to solve as an alternative to Kafka.

Geo-replication was just one improvement resulting from the more general architectural advance that Pulsar made by separating compute and storage. This change at the core of Pulsar allows it to scale more elastically than Kafka as well as to lower costs with tiered storage, where older messages are stored in an object store like HDFS or Amazon S3.

Apache Pulsar is also a superior choice for microservice architectures because of its first-class support for multi-tenancy — allowing multiple services to easily share Pulsar infrastructure, even across different lines of business, while consistently enforcing data retention and security policies. Multi-tenancy is very useful for service providers because it allows them to run the same streaming data platform for multiple customers.

Multi-tenancy is also growing in importance for single organizations, where different units or departments need a level of security and privacy for their customers’ data. Consider the example of a bank: each financial product team wants to manage access and services around customer data, but they won’t want to implement their own complete event streaming implementations. Instead, each team can have their data as part of that multi-tenant environment.

Adding multi-tenancy support to infrastructure software after the fact is incredibly hard. Kafka doesn’t supply this capability; it was designed to run as a single user service, rather than to be multi-tenant. Pulsar, on the other hand, was developed to support multi-tenant deployments from the start and as part of the open source version. The alternative is to stand up a separate streaming deployment for each and every use case, which can quickly grow much more expensive as well as more difficult to manage consistently.

How Pulsar fits into the open source mindset

Software developers today prefer to work with open source. Open source makes it easier for developers to look at their components and use the right ones for their projects. Using a modular, flexible, open architecture not only enables the right mix of best-of-breed tools as the business – and the technology – evolves; it also simplifies the ability to scale.

By taking a fully open source approach, developers can support their business goals more easily. In fact, companies using an open source software data stack are two times more likely to attribute more than 20 percent of their revenue to data and analytics, according to a recent research report by DataStax.

When your developers have the option of using open source projects, they will pick the project that they think is best. This can lead to the issue of creating a level of consistency and cohesiveness in your data stack. Without some consistency of approach, managing the implementation will get harder as you scale. Building on the same set of platforms that carry out their work in the same way can lessen the overhead.

As an example, event streaming features often serve users and systems that are geographically dispersed, so it’s critical that streaming capabilities provide performance, replication, and resiliency across disparate geographies and clouds. Other elements of the application will also have to deliver those same capabilities – so as a database, Apache Cassandra is known for excelling at running across multiple geographies, replicating data and being resilient. Pairing the power and scalability of this NoSQL, open source database with a truly distributed, high-scale streaming technology like Pulsar creates a complete open source data stack that can support the full set of stateful infrastructure needs in microservice architectures.

Pulsar also fits into a broader approach to open source infrastructure that developers and architects will support involving Kubernetes. As a container orchestration platform, Kubernetes manages how applications scale based on demand and it can restart components if they fail. It abstracts the work of managing individual components and lets developers concentrate on how their applications will meet specific use cases. Pulsar supports deployment in Kubernetes alongside other applications, so that you can manage all your infrastructure from one tool.

Pulsar’s role

Companies want to support their customers, and today’s customers expect their applications to deliver results instantly. Companies that put the right infrastructure in place to enable that immediacy will unlock their development teams’ innovation and grow their businesses.

In today’s software development landscape, the ability to use open source components to handle data is a given. However, to meet the next challenge around scaling out applications around data, those open source components have to be part of a coherent, consistent stack. This open data stack should make it easier to support microservice applications in production from a data perspective, scaling to support thousands or millions of customers concurrently. Event streaming will be what connects these microservice applications together, and Pulsar has the best design approach to support how those applications will grow and scale over time.

Apache Cassandra 4.0: Taming Tail Latencies with Java 16 ZGC

Jonathan Ellis — Thu, 24 Jun 2021 20:02:22 +0000

Like so many others in the Apache Cassandra community, I’m extremely excited to see that the 4.0 release is finally here. There are many, many improvements to Cassandra 4.0. One enhancement that is more important than it might look is the addition of support for Java versions 9 and up. This was not trivial, because Java 9 made changes to some internal APIs that the most performance-oriented Java projects like Cassandra relied on (you can read more about this here).

This is a big deal because with Cassandra 4.0, you not only get the direct improvements to performance added by the Apache Cassandra committers, you also unlock the ability to take advantage of seven years of improvements in the JVM (Java Virtual Machine) itself.

Here, I’d like to focus on improvements in Java garbage collection that Cassandra 4.0 coupled with Java 16 offers over Cassandra 3.11 on Java 8.

The garbage collection challenge

In 2012, I gave a talk titled, “Dealing with JVM Limitations in Apache Cassandra.” Here is the first slide from that presentation:

On the one hand, garbage collection is a primary reason that Java is so much more productive than traditional systems languages like C++. As JVM architect Cliff Click once wrote, “Many concurrent algorithms are very easy to write with a GC and totally hard to downright impossible using explicit free.” Cassandra takes full advantage of this power.

But performing garbage collection means having to briefly pause the JVM to determine which objects are no longer in use and can safely be disposed of. These GC pauses can cause delayed response times to client requests, i.e., increased latencies.

Not all requests are affected by this–only the handful of requests that are in flight while Cassandra’s request-handling threads are paused for the GC. The performance impact is thus only visible in tail latencies, that is, the 99th percentile or 99.9th percentile measurements, corresponding to the slowest 1% or 0.1% of requests.

As with so many things, optimizing GC involves tradeoffs, and the original Java GC designs focused more on improving throughput than on reducing pause times. Fast forward to 2021 and we have common server-class CPUs with 64 cores/128 threads—we have plenty of throughput on tap. It’s time to spend some of those cycles on lower pause times.

The Z Garbage Collector (ZGC) was created to address this situation, and specifically to guarantee pause times under 10ms. ZGC was added to Java 11 as an experimental feature, promoted to production in Java 15, and further improved in Java 16.

To show how well ZGC improves Cassandra performance, we compared both throughput and latency in three environments: Cassandra 3.11 running on JDK 8 with its default CMS GC settings, Cassandra 4.0 running on JDK 8 with the same settings, and Cassandra 4.0 running on JDK 16 with ZGC. I’m pleased to report that ZGC convincingly achieves its design goals, allowing Cassandra to deliver nearly-constant latencies through the 99th percentile, with only a modest uptick at the 99.9th percentile!

ZGC performance results

My colleague Jonathan Shook benchmarked the performance characteristics of Cassandra 3.11 and 4.0 in detail across three workloads: simple key/value, a time series workload with many rows per partition, and a tabular workload with one row per partition but many columns per row.

Throughput results

Here we are looking at Cassandra running at 70% of maximum throughput. This leaves 30% operational headroom to absorb compaction, repair, or load spikes for the purposes of realistic measurements.

Cassandra 4.0 running with the same configuration as Cassandra 3.11 is 30% faster in the key/value workload, 2% slower in the time series workload, and 10% faster in the tabular workload. Turning on ZGC unlocks an additional 30% more throughput for key/value and time series workloads, but has no effect on the tabular workload.

Latency results

I’ve split the latency results into one chart per workload so it’s easier to see the trends across the different percentiles:

For these results, we limited each test scenario to the slowest system’s throughput, i.e., we used 30,000, 44,000, and 54,000 requests per second for the key/value, time series, and tabular workloads, respectively.

Cassandra 4.0’s latencies are virtually identical to 3.11’s with the same GC settings, but ZGC is consistently better, up to a solid factor of 5 to 10 better at p99 and p999 percentiles.

The NoSQLBench performance testing suite

Most benchmarks of non-relational databases are done with either product-specific tooling (like cassandra-stress), or with YCSB, which gives you a lowest-common-denominator key-value workload across dozens of systems.

Jonathan Shook created NoSQLBench to be a cross-platform performance testing tool that is easier to use than cassandra-stress and (much) more powerful than YCSB; in fact, its scripting layer is powerful enough to support things that no other testing tool can enable, with particular emphasis on modeling complex workloads with fidelity, as well as simulating realistic scenarios such as load spikes. As its name suggests, NoSQLBench is not Cassandra-specific and encourages participation from all who want to contribute; today there are clients for Cassandra, CockroachDB, JDBC, and MongoDB, as well as non-database products Kafka and Pulsar. If you’re serious about performance testing in 2021, you should check out NoSQLBench. You can get started at GitHub. Other useful links: releases, discord, docs.

The NoSQLBench workload descriptions for the tests in this post can be found here.

Conclusion

Without switching to ZGC, Cassandra 4.0 offers modest but real throughput improvements for key/value and tabular workloads.

Combining Cassandra 4.0 with ZGC in Java 16 results in further improvements to throughput for key/value and time series workloads as well as convincingly demonstrating ZGC’s design goals to make GC pause time a non-issue across all tested workloads for Cassandra 4.0.

ZGC is production-ready starting with Java 15; for enterprises that want to stick with LTS releases, ZGC will be one of the headlining reasons to upgrade to the Java 17 LTS release later this year. ZGC is one of the most significant performance “free lunches” available, and it Just Works—the results shown here are out-of-the-box for ZGC with no extra tuning.

Appendix: Test environment

All tests were run on the same physical cluster of AWS i3.4xl nodes: 16 vCPUs, 122GB RAM, 10Gb network, 5 nodes in the cluster. Storage was configured as XFS on direct NVMe, single volume. All data was stored at RF3. Assigned tokens were used to ensure consistent data distribution across the tested versions. Consistency level for all operations was set as LOCAL_QUORUM. Concurrency from the client side was set at 960 (20x client cores) for the keyvalue test, and 480 (10x client cores) for the time-series and tabular tests. All measurements were taken from the client, and include duration between submitting and fully reading any data in results. All measurements were taken with 3 significant digits of precision, then rounded to the nearest ms. ZGC was configured with basic recommended settings: 16GB min heap, 64GB max heap, large pages enabled. The other numbers are using Cassandra’s out-of-the-box configuration with CMS.