DEV Community: Stefano Lottini

The Subtleties of Vector Similarity Scales (part 4)

Stefano Lottini — Mon, 04 Mar 2024 22:35:39 +0000

Here’s the fourth and final installment of this series on learnings from building a vector database. We covered the basic notions of vectors and interacting with them, the behavior of vector similarities, and their usage with Apache Cassandra and DataStax Astra DB. Here, we’ll explore the pitfalls associated with rescaling similarities, and bring it to life with an end-to-end migration example.

In this series (as is the case for most applications out there), we’ve preferred to work with "similarities" rather than "distances.” This is because the former lend themselves more easily to be cast into the notion of a "score" bounded between known values.

For most applications, knowing that zero means "least similar" and one means "most similar" is all that counts. However, a few points should be kept in mind.

Scale

The choice of scaling this score between zero and one is just for convenience; there is nothing special in this, except the fact that it "feels natural." And, sure enough, this is what Apache Cassandra and DataStax Astra DB do, as can be checked by looking at the definitions given in part 1. These achieve a final result bound to lie between zero and one, albeit through very different formulae for the Cosine and the Euclidean cases.

Alternate cosine similarity

When working with the cosine similarity, however, it is important to note that another, different scale is very common in textbooks and references (such as Wikipedia). Especially in more mathematical-oriented applications, one often prefers the following definition (denoted with a superscript star in this writeup):

This definition is such that identically-oriented vectors have S*cos = 1 all right, while exactly opposed vectors yield S*cos = -1. In other words, the two cosine similarities are related by the simple linear rescaling Scos = (1+S*cos)/2.

The meaning of the scale

At this point it is clear that the numeric values of similarities have no intrinsic meaning by themselves. They are very useful to anchor comparisons, such as when determining – and then applying – a cutoff threshold, but not much more. Stated differently, just knowing that "vectors v1 and v2 have a similarity of 0.8" is of little importance without a comparison context. This is even more true across measures: a 0.8 with Euclidean, for example, has nothing to do with a 0.8 from cosine (earlier I gave an explicit function to translate values, but that holds on the sphere only).

Mathematically speaking, one could have chosen any of the infinite ways to construct a well-behaving "similarity function" of two vectors; while there is no strong formal principle to favor one over the other, all these candidate similarities may well yield different numeric values for the same pair of vectors. This is the reason for the claim that similarity values are an arbitrary, conventional notion.

Intermezzo on vector embeddings

The special case of vectors from embedding models comes with special problems and caveats, and is not the main scope of this article. Yet two things are worth mentioning here: first, that the same two sentences will have different values for the similarity if using different models (even when using the Cosine similarity throughout!). And, second, one should not expect by any means that "extremely different sentences" will result in vectors with zero similarity. The latter is a somewhat common misconception, possibly fueled by an erroneous interpretation of this score as a "semantic-relatedness percentage." The truth is, with most embedding models one would have a hard time coming up with two sentences whose similarity (Scos) goes below 0.75 or so. The lesson here is: rescale your expectations accordingly. There'll be a follow-up article specifically targeted at embeddings-related issues.

The pesky dot, again

I just mentioned how the various similarities are engineered to be all bound in the very handy [0:1] interval. Well, strictly speaking, that’s a lie: for the dot-product similarity is designed just to be used as a replacement for the cosine where they coincide (i.e. on unit-norm vectors). So, once again, if you use the dot-product for arbitrary vectors, which at this point you will surely see as a weird choice anyway, do not expect the similarities to be bounded in any way. In fact, as the formulae given earlier would show, your dot-product similarity between arbitrary vectors can be anything from negative infinity all the way to positive infinity!

Similarity of one

One must not assume that similarity of one means coinciding vectors. This is true only for either Euclidean similarities or on the unit sphere. The counterexample is that of the cosine similarity between two vectors, one a multiple of the other (even worse, for the dot-product off the sphere, you have seen how 1 is not a "special" value at all).

Case study: Migration of a vector app

One of the lessons from this (admittedly a bit theoretical) exposition is that you should always read the fine print when it comes to vector stores and the specific mathematical definitions that are used for the similarities.

To illustrate this point in a practical manner, let’s look at what kind of care should be taken concerning similarities when migrating a typical vector-powered application between vector stores. Let's say you are moving from Chroma to Cassandra / Astra DB. Your application stores vectors and runs ANN search queries, possibly involving a cutoff threshold, previously determined through dedicated analysis, on the results' "scores" (whatever they are). Our task now is to ensure the application behaves exactly the same after the migration.

Note: below you'll see a detailed investigation on how Chroma behaves. This has been chosen just as a representative example, the main point being that such level of care should be exercised when migrating vector-based workloads between any two databases!

Fulfilling the stated goal requires:

using the same "kind of similarity" (what I called measure earlier)
being aware of the precise definition for the similarity (different scales and such), and correcting for any difference
of course, adapting the code to using another library!

The third point is not really in scope for this illustrative example; we are most interested in the previous steps. Let's start!

The Chroma-backed "app" you're migrating is the following Python script. It creates a vector store (with Cosine measure), puts a few vectors in it, and runs an ANN search to print the resulting matches, the associated number, and whether these are "close enough to the query" (for some unspecified purpose). All vectors are guaranteed to have unit norm.

import chromadb
chroma_client = chromadb.Client()

# Creating a Vector store
cos_coll = chroma_client.create_collection(
    name="cosine_coll",
    metadata={"hnsw:space": "cosine"},
)

# Saving vector entries
cos_coll.add(
    documents=["3 o-clock", "6 o'clock", "9 o'clock"],
    embeddings=[[1, 0], [0, -1], [-1, 0]],
    ids=["3:00", "6:00", "9:00"],
)

# Running ANN search
cos_matches = cos_coll.query(
    query_embeddings=[[1, 0]],
    n_results=3
)

chroma_threshold = 1.5

# Printing the results and their "distance"
match_ids = cos_matches["ids"][0]
match_distances = cos_matches["distances"][0]
for m_id, m_distance in zip(match_ids, match_distances):
    status = "ok" if m_distance <= chroma_threshold else "NO!"
    print(f"d([1,0], '{m_id})' = {m_distance}: {status}")

For illustrative purposes, the script inserts two-dimensional vectors arranged as the hour hand of a clock at various times, the query vectors being the "3 o'clock" right-pointing direction.

Caption: The "clock model" illustrates the vectors used in the "sample application". The red vectors are the inserted rows, and the blue vector is the query vector used throughout.

Running the above program (as tested with chromadb==0.4.21) has this output:

d([1,0], '3:00') = 0.0: ok
d([1,0], '6:00') = 1.0: ok
d([1,0], '9:00') = 2.0: NO!

Do you notice anything here? Well, the number Chroma returns with the matches is not a similarity at all, rather a distance! Indeed, it increases from the closest to the farthest match. This can be verified on the Chroma docs page, where all relevant formulae are provided. This is very useful information if one is to port an application to a different vector store!

One finds out that, regardless of the measure, Chroma always works in terms of a distance, and that the Cosine choice is no exception, with a "Cosine distance" defined as:

In other words, one can relate this quantity to the familiar similarity through dcos^Chroma(v1, v2) = 1 - S*cos(v1, v2) = 2 - 2 Scos(v1, v2), equivalent to the inverse mapping Scos = 1- dcos^Chroma / 2.

But there is more in the way of translations: indeed, the inequalities in the original code have to be reversed to keep their meaning. Where the Chroma code has distance <= chroma_threshold, for example, you'll need to place a condition such as similarity > cass_threshold in the ported code, where cass_threshold = 1 - chroma_threshold / 2, following the mapping above.

Side note: When possible, it’s better to translate thresholds rather than similarities/distances. This can be done "at coding time," generally minimizing the chance of errors/inconsistencies, and in some cases (e.g. when using higher abstractions around a vector store) might be the only feasible choice.

Finally, it should be noted that whereas in Chroma the default measure is Euclidean, Cassandra and Astra DB employ cosine when not explicitly chosen: it may be safer and less prone to surprises to always spell it out when creating vector stores.
So, the "application," once migrated to Astra DB, is comprised of a CQL schema creation script, looking like:

// Table creation (CQL)
CREATE TABLE cos_table (
  id TEXT PRIMARY KEY, my_vector VECTOR<FLOAT, 2>
);

// Vector index creation (CQL)
CREATE CUSTOM INDEX cos_table_v_index ON cos_table(my_vector)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {'similarity_function': 'COSINE'};

plus the "app" itself, the following Python script:

# Connecting to DB
from cassandra.cluster import Cluster
cluster = Cluster(...)  # connection to DB
session = cluster.connect()

# Saving vector entries
session.execute("""
    INSERT INTO cos_table (id, my_vector)
    VALUES ('3:00', [1, 0]);
""")
session.execute("""
    INSERT INTO cos_table (id, my_vector)
    VALUES ('6:00', [0, -1]);
""")
session.execute("""
    INSERT INTO cos_table (id, my_vector)
    VALUES ('9:00', [-1, 0]);
""")

# Running ANN search
ann_query = """
SELECT
  id,
  my_vector,
  similarity_cosine([1, 0], my_vector) as sim
FROM cos_table
ORDER BY my_vector ANN OF [1, 0]
LIMIT 3;"""
cos_matches = session.execute(ann_query)

chroma_threshold = 1.5
cass_threshold = 1 - chroma_threshold / 2

# Printing the results and their "similarity"
for match in cos_matches:
    # While we're at it, we recast to Chroma distance
    chroma_dist = 1 - match.sim
    #
    status = "ok" if match.sim > cass_threshold else "NO!"
    print(
        f"d([1,0], '{match.id})' = {match.sim}: {status} "
        f"(d_chroma = {chroma_dist})"
    )

The output of this, as expected, will be:

d([1,0], '3:00') = 1: ok (d_chroma = 0)
d([1,0], '6:00') = 0.5: ok (d_chroma = 1)
d([1,0], '9:00') = 0: NO! (d_chroma = 2)

As you see, one has to pay some attention to avoid getting caught in the subtleties of distances, similarities, and definitions. It's definitely better to always read the fine print and play with a toy model to check one's assumptions on known cases (such as the "clock model" vectors used above).

Were the original application using the Euclidean measure (but still working on the unit sphere), one would be in for another surprise: namely, what Chroma calls "Euclidean distance" is actually the squared distance! In other words, deucl^Chroma(v1, v2) = δ²eucl(v1, v2).

Once this bit is acknowledged, the rest proceeds in the same manner as seen above. Distances (Chroma) grow when similarities (Cassandra / Astra DB) decrease, inequalities have to be reversed, and the following mapping needs to be used: Seucl = 1 / (1 + deucl^Chroma), i.e. deucl^Chroma = (1/Seucl) - 1. Note that a consequence is that, on the sphere, the Chroma Euclidean distance ranges from zero (most similar) to four (most dissimilar, i.e. antipodal vectors on the sphere).

The sheer amount of possible ways to quantify the position of two vectors, with different stores and different similarities, is enough to make you feel a bit dizzy – the lesson here is that one should make no unwarranted assumptions and verify definitions thoroughly. Test with known vectors, check the docs for formulae! To complete the exercise, here is a complete "translation map" between all distances/similarities encountered in this migration example:

In the table above, which expresses each quantity as a function of any other, the white cells are always valid, while the darkened ones are relations that hold only on the sphere (i.e. where it makes sense to recast Euclidean notions to Cosine, and vice versa, unambiguously).

You can also check the values these quantities assume with the three "clock" vector positions that were used in the example code (remember these are unit-norm vectors):

Embedded in LangChain

Your original code to migrate might be using a framework rather than directly accessing the Chroma primitives, for example it might be a LangChain application leveraging the langchain.vectorstores.Chroma vector store abstraction. As can be verified by inspecting the plugin source code (or running suitable test code, although this turns out to be more convoluted due to LangChain's choice of abstractions around embeddings), essentially the same API as before is exposed through the LangChain object, so that one should specify the cosine measure by passing a specific parameter when creating the store:

from langchain_community.vectorstores import Chroma
my_store = Chroma(
    ...,
    collection_metadata={"hnsw:space": "cosine"},
)

The "score" returned by methods such as similarity_search_with_score, likewise, is the very "distance" coming from the Chroma methods, so the same conversions seen above are required.
Likewise, when using the langchain.vectorstores.Cassandra class, the "score" will be exactly the similarity Seucl seen earlier and bound in the [0:1] interval.

Conclusion

This technical deep dive has highlighted the definitions, the quirks and the caveats to keep in mind when approaching the concept of similarity when querying vector stores. As you have seen, subtleties abound. Luckily, awareness of the underlying mathematical structure helps avoiding fruitless pursuits and actively counterproductive choices.

So, armed with all this knowledge … why not create a free account on Astra DB and start playing with vector search?

Using Vectors with Apache Cassandra and DataStax Astra DB (part 3)

Stefano Lottini — Mon, 26 Feb 2024 15:28:38 +0000

Welcome to part three of my series on learnings from building a vector database. In part one, we covered some of the basics about vectors. Part two was a deep dive into different similarities, on and off the unit sphere. Here, we’ll explore the use of vectors with Apache Cassandra and DataStax Astra DB.

Similarity search and approximate nearest neighbor

The key interrogation one wants to run in a vector database is "similarity search": given a certain query vector V, one wants the entries whose vectors have the highest similarity to V. The next problem is, how to make vector search efficient when there are potentially many millions of vectors in the database. Luckily, there are several advanced algorithms to do that within a DB engine, usually involving the so-called Approximate Nearest Neighbour (ANN) search one way or the other. In practice this means that one trades the (very small) likelihood of missing some matches for a (substantial) gain in performance.

I won't spend many words here on the topic of how to equip Cassandra with state-of-the-art ANN search: if you are curious, check out this excellent article that tells the story of how we did it! I'll just mention that, by maintaining a vector-specialized index alongside the table, Cassandra / Astra DB can offer flexible, performant and massively scalable vector search. Crucially, this index (a certain type of SAI index, in Cassandra parlance) needs a specific choice of measure right from the start.

I'll now take a closer look at how the interaction with Cassandra – or equivalently Astra DB – works. You will see vector-search queries expressed in the Cassandra Query Language (CQL). CQL is a powerful way to interact with the database, and vector-related workloads are no exception.

This section assumes CQL is used throughout to interact with the database (the reason is that CQL is permissive enough to let you violate the querying best practices as outlined below).

Now, I personally think CQL is great, but I appreciate that not everyone might have the time, the motivation, or the need to learn how to use it.

For this reason, there are several higher-level abstractions that effectively "hide" the CQL foundations and allow one to focus on the vector-related application itself. I'll just mention, for Python users, CassIO, which in turn powers LangChain's and LlamaIndex's plugins for Cassandra. For an easier-to-use and cross-language experience, take a look at the REST-based Data API available in Astra DB.

Index time versus query time

The two basic operations when using a vector-enabled database table are: inserting entries, and running ANN searches. Actually, there is another important ingredient that comes before any other: creating the table, along with the required SAI index for vector search. Here is how the typical usage might unfold:

Creation of table and index This is where you choose a measure for the table (e.g. Euclidean).
Writing entries to the table (i.e. rows with their vector). All vectors must have the dimensionality specified when creating the table.
Querying with ANN search "Give me the k entries whose vector is the most similar to a query vector V".
Further writing and querying in any order (etc, etc …)

The simple sequence above stresses a key fact: you commit to a measure right when you create a table. Unless you drop the index and re-create it anew, any ANN search you'll run will use that measure. In other words, the measure used for computing similarities is fixed at index-time.

Here is the CQL code for a minimal creation of a vector table with the related index:

CREATE TABLE my_v_table
  (id TEXT PRIMARY KEY, my_vector VECTOR<FLOAT, 3>);
CREATE CUSTOM INDEX my_v_index ON my_v_table(my_vector)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {'similarity_function': 'EUCLIDEAN'};

For the sake of completeness, here is the CQL to insert a row in the table:

INSERT INTO my_v_table
  (id, my_vector)
VALUES
  ('mill_falcon', [8, 3.5, -4.2]);

Finally, this is how an ANN search is expressed:

SELECT id, my_vector
  FROM my_v_table
  ORDER BY my_vector ANN OF [5.0, -1.0, 6.5]
  LIMIT 5;

As you see, no measure is specified at query-time anymore. The CQL for vector search admits other bits and options I won't mention here (feel free to visit the docs page), but there is an important thing you can add. Observe:

SELECT
  id,
  my_vector,
  similarity_euclidean(my_vector, [5.0, -1.0, 6.5]) as sim
FROM my_v_table
  ORDER BY my_vector ANN OF [5.0, -1.0, 6.5]
  LIMIT 5;

This last statement can be phrased as:

Give me the five most-similar rows to a query vector V and, for each of them, tell me also the euclidean similarity between the row and V itself.

The new bit is that one asks for an additional sim column (the alias is strongly suggested here, to avoid unwieldy column names) containing the numeric value of the similarity between the row and the provided vector. This similarity is not stored anywhere on the DB — and how could it be, as its value depends on the vector provided in the query itself?

Now pay attention: nothing prevents one from using two different vectors in the query, or even to ask for a similarity computed with a different measure than the one chosen earlier for the index (hence used for searching)!

// Warning: WEIRD QUERY
//   (check the vector values and the chosen measure).
// Why should you do this?
SELECT
  id,
  my_vector,
  similarity_dot_product(my_vector, [-7, 0, 11]) as weird_sim
FROM my_v_table
  ORDER BY my_vector ANN OF [5.0, -1.0, 6.5]
  LIMIT 5;

In words: give me the five rows closest to V, according to the Euclidean similarity, and compute their Dot-product similarity to this other vector W.

Pretty odd, no? The point is, this is something CQL does not actively prevent, but that does not mean you should do it:

Using two different similarities at index-time and query-time is probably never a sensible choice.
Using two different vectors for the ANN part and the sim column (i.e. W != V) is something that hardly makes sense; in particular, the returned rows would not be sorted highest-to-lowest-similarity anymore.

In short, the measure for the sim column is specified at query-time. Moreover, the query vector and the one for the sim column are passed as two separate parameters.

The practical advice is then this: do not pass two different vectors, and do not mix similarities between index and query.

As usual, the confusion from mixing similarities is greatly reduced if working on the sphere (the ordering of results, at least, would not "look weird").

Now, there might be very specific use cases that could knowingly take advantage of this kind of CQL freedom. But, frankly, if you are into that kind of thing, you already know the topic covered in this section (while you're at it, why don't you drop a comment below on your use case? I'd be curious to hear about it).

Note: when using the Data API, as opposed to plain CQL, you do not retain the freedom to ask for a similarity other than the one configured for the table index, as well as the freedom to use W != V in vector searches. And, as was argued in this section, this is not a bad thing at all!

Caption: Bad (or at least questionable) practices with vector searches in CQL. Left: SELECT similarity_euclidean(my_vector, V) as sim ... ORDER BY my_vector ANN OF V on a table created with the Cosine similarity. The returned rows will be r1 and r2 in that order, but since (Euclidean!) δ1 > δ2, then the sim column will come back in increasing order. Right: SELECT similarity_euclidean(my_vector, W) as sim ... ORDER BY my_vector ANN OF V on a table created with Euclidean. The closest-to-farthest sorting of the rows is referred to V, but the similarity is calculated on W: so, one gets back rows r1 and r2 in that order, but the sim column lists increasing values again.

Null vectors and cosine

A close look at how the similarities are computed will reveal a property specific to the cosine choice: for the null vector, i.e. a vector with all components equal to zero, the definition makes no sense at all!

Beneath the "technical" problem of a division by zero (|v|=0) arising in the similarity formula, indeed, lies the deeper mathematical reason: a segment with zero length has no defined direction to speak of.

What does this imply? In practice, if you plan to use the cosine similarity, you must ensure that only non-null vectors are inserted. But then again, if you choose this similarity, you could as well rescale each incoming vector to unit length and live in the comfort of the sphere … which is something that can be done on all vectors except the null vector! So I've come full circle.

Note that the null vector has no problem whatsoever with the Euclidean distance, nor with the Dot-product (but the latter, as remarked earlier, probably makes not much sense out of the sphere).

So, by now you must be curious as to how Cassandra / Astra DB handles this situation. Here are the various cases:

Running a search such as SELECT ... ANN OF [0, 0, ...] raises a database error such as Operation failed - received 0 responses and 2 failures: UNKNOWN from ...
Inserting a row, e.g. INSERT INTO ... VALUES (..., [0, 0, ...], ...) raises the same error from the DB: Operation failed - received 0 responses and 2 failures: UNKNOWN from ...
Using the null vector for a similarity column computed during the query, such as SELECT ... similarity_cosine(my_vector, [0, 0, ...]) ..., returns a sim column of all NaN (not-a-number) values. This remains true also when running queries through the Python Cassandra driver.
The most interesting case is when the table already contains some null-vector rows before the vector SAI index is created. In that case, no errors are raised, but the rows will be silently left out of the index and never reached by ANN searches. This can be tricky, and should be kept in mind.

Note: when using the Data API, the underlying behavior for Cosine-based collections is the same as outlined above – except, the last case is not really possible as the index is created as soon as the collection is available.

The first three installments of this mini-series cover most of what you'll need to know your way around the mathematics, and the practical aspects, of similarity computations between vectors. However, it turns out that a few obstacles may get in your way if you want to perform something as mundane as, say, switching between vector stores in an existing GenAI application. In the next, and last, article, we’ll look at a complete example of such a migration, paying close attention to the hidden surprises along the way. See you next time!

Vector Similarities: How Similar Are They? (part 2)

Stefano Lottini — Tue, 20 Feb 2024 15:29:18 +0000

In a previous post, we discussed the ins and outs of vectors and how to interact with them.

Here, I’ll provide a more detailed account of vector similarities and how they behave. Let's start with the Euclidean and the Cosine similarities.

Let’s say you want to measure how close two points (two vectors) are. How do you proceed? If your plan is to place a ruler between them and, well, measure the distance, then congratulations: you just invented what is called "Euclidean" (or L2) distance. Let's call it δeucl, a number ranging from zero to infinity (you have encountered it already in the introductory table above).

The next step is to make this quantity into a similarity proper: the choice made in Apache Cassandra and in DataStax Astra DB, Seucl = 1/(1+δeucl²), guarantees a quantity in the zero-to-one range and is also computationally not too expensive. Moreover, one gets Seucl=1 for two identical vectors, while very, very far-apart vectors would have a similarity approaching zero. So far for the Euclidean similarity.

Of the (few) other choices, the Cosine similarity is the most common (and is available in Cassandra / Astra DB). The key aspect is that it discards the length part of the vectors and just looks at how well their directions are aligned to each other. Consider the angle formed between the two vectors: when this angle is zero, you have two arrows pointing exactly in the same direction, i.e. two vectors with similarity of one. As the angle increases, the arrows are less and less aligned, and their cosine similarity decreases, down to the extreme case of two perfectly opposed vectors, which corresponds to similarity zero. It takes a moment of thinking to notice that this definition, possibly a bit weird when the vectors may have different norms (i.e. lengths), "feels correct" when all vectors are of unit length.

Important note: rescaling the similarities so that they range between zero and one is a somewhat arbitrary, but practically useful, choice. Elsewhere, you may see a slightly different definition for the cosine similarity, S*cos, chosen instead to range from -1 to +1. There are pros and cons in both choices of S*cos and Scos, and a good suggestion is to always be aware of the specifics of the similarity function your system is using (for more on the subtleties of the scale of similarities, check the last sections of this writeup).

The comfort of the sphere

There should be a saying along the lines of, "On the sphere, life is easy."

The essence of vector search is as follows: given a "query vector," one wants to find, among those stored in the database, the vectors that are most similar to it, sorted by decreasing similarity. Of course, this is based on the similarity measure you choose to adopt! So, what are the changes in the results for different similarities?

You have just seen the Cosine and the Euclidean similarities. There is no doubt they do different things. But here is an important fact: on the sphere, no matter what similarity you use, you get the same top results in the same order.

The reason is perhaps best given with a picture:

Caption: On the unit sphere, no matter whether using the Euclidean or Cosine similarity, B is more similar to A than to C. In other words, smaller angles subtend shorter segments: angle(AB) < angle(AC) iff AB < AC. (Related: it so happens that most sentence embedding vectors are indeed of unit length, and when they are not, you are probably better off by normalizing the vector to unit length yourself: v ⇒ v/|v|).

Here, then, is the main takeaway for vectors on a sphere: regardless of whether you use the Euclidean or the Cosine similarity, a vector search returns the same top matches, arranged in the same order. The only difference is in the exact values for the similarities. But this is a difference fully under our control: you see, on the sphere it can be proven that there is a mathematically precise relation between the similarities for a given pair of vectors.

Caption: On the unit sphere, and only in that case, the Cosine and Euclidean similarities can be related to each other directly, without the need for additional information on the vectors. The relations, pictured here, are as follows: Seucl = 1/(5-4Scos) and conversely Scos = (5Seucl-1)/(4Seucl). It is important to notice that these are increasing functions, implying that switching between measures does not alter the "more similar than…" relation. Note also that, on the sphere, the Euclidean similarity has its minimal value at 0.2 (corresponding to the fact that two points on the unit sphere cannot have a distance of more than δeucl,max = 2).

The fact that the transformations between Seucl and Scos are strictly increasing functions is at the heart of the "indifference to the measure used" mentioned above. What about thresholds, for instance if your Cosine-powered search needs to cut away vectors with similarity below a certain Scos^min? You can translate thresholds as you'd do similarities, and run the search with the Euclidean measure provided you translate cuts to Seucl^min = 1/(5-4Scos^min).

So here is another takeaway. When you are on a sphere, there is no substantial difference between the Cosine and Euclidean approaches – you should just pick the less CPU-intensive solution to reap a performance boost. And, as it turns out, the least CPU cycles are spent with ... the Dot-product measure, which you'll examine next!

To summarize:

There's not much to learn by "experimenting with different indexes" on a sphere
There's nothing deep to be found in "testing different similarities" on a sphere

Dot-product and the sphere

As we anticipated, Dot-product is the odd one of the bunch. But, to one's relief, on the unit sphere, and only there, its results coincide exactly with the Cosine similarity.
The reason becomes clear by looking at the definitions:

One quickly notices that on the sphere the norms can indeed be omitted, since in that case we have |v|=1 for all vectors v: it follows that Scos = Sdot.

You can then treat the Dot-product similarity as a fully valid substitute for the Cosine similarity, with the advantage that the former requires the least amount of computation. In short, if you can work on the sphere, i.e. if you can ensure all vectors have unit norm, there is no real reason to use anything other than Dot-product in your vector search tables.

Leaving the sphere (i.e. arbitrary vectors)

One lesson so far is that you probably should work on the sphere when it makes sense to do so. If you are working with embedding vectors (whether computed for texts, images or anything really), probably this is the case. Note that you may need to normalize the vectors yourself, since some embedding models may output arbitrary-norm vectors.

There are situations, however, in which the length of vectors also carries important information: an example may be two-dimensional vectors expressing positions on a map for a geo-search-based application. Typically, in these cases you should use the Euclidean measure.

When working with arbitrary vectors, there are real differences between the Cosine and the Euclidean measures: in particular, you cannot just "translate similarities to similarities" without knowing the values of the two vectors themselves.

In other words, you will get different top results when running the same query on the same dataset depending on the measure employed. Let's capture the reason with a drawing:

Caption: Unless your vectors are all of unit length, Euclidean and Cosine similarities tell a different story. Which is the vector closest to A: B or C? (Solution: according to the Euclidean measure, C is most similar to A, while the Cosine similarity has B ranking first.)

Now, with arbitrary vectors, the question whether to use Euclidean or Cosine similarity is more substantial. But, as anticipated earlier, if you choose Cosine you might as well have normalized your vectors at ingestion time to have them on a sphere ... and have in practice used Dot-product! Otherwise, yours is one of those cases for which only the Euclidean choice makes sense: for example, your vectors are the three-dimensional coordinates of inhabited planets and your app helps galactic hitchhikers locate the civilizations closest to them.

That pesky Dot-product

If you are not on the sphere, the Dot-product similarity is not just a "faster Cosine." In this case, the recommendation is not to use it altogether.

The fundamental reason that makes Dot-product stand apart is that this measure, as a candidate way to assess distances, fails our intuition. (This reasoning requires us to speak of "distances," contrary to our earlier resolution – don't worry, we can keep this a very short and tangential digression.)

Here is something we expect from any notion of "distance":

if A is very close to B, and B is very far from C, then A is very far from C

It turns out that the Dot-product admits counter examples that falsify the above intuition (which can be shown to be a consequence of a stricter requirement called "triangle inequality," when taking a particular limiting case). One such counterexample is A = [1, 0]; B = [1, -1/M]; C = [1, M] for sufficiently large values of M.

You can get an intuitive picture of this problem as follows: the Dot-product measure looks at something like "shadows of the vectors as cast with the light source (the query vector) positioned at a certain specific position". Two vectors, whose shadows almost overlap on one wall, might turn out to be in fact not close at all if the light source is moved elsewhere (i.e. searching with a different query vector) and the shadow is cast on another wall.

Caption: With the Dot-product measure, the notion of "similar vectors" depends on what query vector is the comparison anchored to. In the above depiction, the query vector is the light source, which projects objects to one wall or another – with dramatic effects on whether they appear to be close to each other or not. The concept of "being close to each other", therefore, is defined not just by the two items being compared (the rabbit and the pear in the cartoon).

Caption: The quirks of Dot-product, seen in a more formal but otherwise analogous depiction to the "room with a pear and a rabbit" cartoon. If you look at the Dot product with the x axis, i.e. the [1, 0] axis, the two vectors turn out to be much dissimilar – while if you refer them to the [0, 1] query vector they'll look very similar. The notion of "similar vectors" is not an intrinsic property of the vector pair itself.

This oddball character of the Dot-product has three important consequences:

First, as its expected usage is that of a "faster replacement for Cosine on a sphere", as you step outside of the sphere, your Dot-product similarities are not in the zero-to-one interval anymore. This is a consequence of its mathematical definition:
Second, there is generally no optimization effort aimed at the off-sphere usage of Dot-product. Should you insist on using this measure for arbitrary vectors (and it turns out there are cases where this might have an appeal, if you know what you are doing), the performance implications may be somewhat of an uncharted territory.
Third, to further elaborate on the previous point, the core idea of the indexes that make vector search so fast rely on precomputed data structures, stored in the index for use by all queries, that provide information on "what is close to what." But here one has a case where closeness changes drastically depending on the query vector being used! So, the effectiveness of such a precomputed structure is greatly hampered and each vector-search query needs much more effort in order to be executed satisfactorily.

Summary table

As we're about to change subject, let's wrap what we've seen so far:

measure	domain	notes
Euclidean	sphere (unit-norm vectors)	Consider switching to Cosine or Dot-product (just the numeric similarities change, ordering unchanged)
Cosine	sphere (unit-norm vectors)	If you know you're on a sphere, switch to Dot-product (the only change is faster computation)
Dot-product	sphere (unit-norm vectors)	The best-performance measure on the sphere
Euclidean	arbitrary vectors	Use this if every aspect of the vector (incl. the norm) carries useful information
Cosine	arbitrary vectors	Consider normalizing vectors to unit norm at ingestion time and, once on the sphere, switching to Dot-product
Dot-product	arbitrary vectors	Probably not what you want (ensure there's a strong reason to fall in this case); performance may be tricky

The upcoming installment of this series is devoted to the specific choices made in Astra DB and Cassandra with regard to similarities: how to create tables, how to perform queries, what are the associated best practices and pitfalls. Keep reading to find out more – see you next time!

What We Learned When We Built a Vector Database-and Our Customers Started Using It (part 1)

Stefano Lottini — Thu, 15 Feb 2024 00:55:30 +0000

Everyone is talking about vectors these days. Cosines, ANN searches, normalizations, sentence embeddings—there’s so much to know, it can feel a bit overwhelming at times!

In the last half year or so, the engineering team at DataStax has been busy delivering a performant, efficient, and scalable vector database experience. Our work spanned many areas, including – crucially – offering guidance to customers and helping them figure out how to get the best out of their vector-based applications.

On this journey (which, as the saying goes, is far from over), we noticed recurring pitfalls, dead ends, and, generally speaking, "things one should've known earlier," on topics ranging from the basics all the way to weird corner cases one does not often think about.

This four-part series of articles is an attempt to collect some of these findings for the benefit of the reader and their future vector-based endeavors. Although these posts are mostly about "mathematical" properties of vectors (which are valid irrespective of the backend you’re using), a few of the remarks will be specific to Apache Cassandra^®'s and DataStax Astra DB's stance on vectors.

This first article covers general properties of vectors and tips to interact with them, independent of their origin and purpose. The second article will examine the various notions of vector similarity and their properties. In article three, we’ll take a closer look at the usage of Cassandra and Astra DB with vectors. Finally, we’ll offer a couple of quirks and corner cases to keep in mind in your Cassandra- or Astra DB-enabled vector applications and a small example of a full "migration between vector stores," with an emphasis on the pitfalls with distances and similarities.

The basics

What is a vector?
A vector is a quantity used in geometry and most sciences to denote a phenomenon with a direction and a length (the mathematically rigorous readers, I am sure, will pardon this very practical definition). To describe the blowing of the wind at a given point, for instance, you might use a vector (the vector "length" would be then the wind intensity). With a certain choice of a reference basis (e.g. the x, y and z axes for three-dimensional vectors), a vector can be formulated as a list made of numbers (its three x, y, z components), as many numbers as is its dimensionality. While typical vectors one can imagine in the "real world around us" are 3-dimensional or less, nothing prevents you from thinking of 10-dimensional, 768-dimensional, 1536-dimensional vectors (… or even infinite-dimensional vectors. But I digress).

A vector, in short, is a way to denote a point belonging to a given space. You can picture a vector as an arrow, whose tip is the denoted point. It is an expectation from vector geometry that the "length" (or norm) of a vector be defined regardless which direction it is oriented, i.e. a vector space implies some meaningful notion of "rotation" for its vectors.

Don’t confuse the length with the count of the components (i.e. the dimension, which is always a positive integer).
Take all possible 2-dimensional (x, y) vectors: the points they describe form the whole of a flat plane! Now take only the vectors of length equal to one ("unit vectors"): their points form a circle (of radius one). Likewise, the unit length vectors with dimension three are a sphere in space.

Below, I’ll speak of "spheres" in a broad sense, regardless of the dimension (circles, spheres and "hyperspheres" alike): so, when I say "vectors on a sphere," I actually just mean "vectors of some dimensionality whose length is equal to one."

Of similar and dissimilar vectors
Comparing numbers is easy: 10 is similar to 10.3 and very different from, say, 9979. Just look at (the absolute value of) their difference! What about vectors: how can you say whether two vectors are "similar", or "close", to each other?
It turns out that with vectors there are several ways ("measures") to decide what "close to each other" might mean.

The subtle differences between the available definitions are enough to warrant a deeper investigation... and a dedicated section later in this post!

Even though these definitions differ in their mathematical formulation, they are all ways to quantify, with a number, the degree of similarity between vectors. Conceptually, higher similarity means the vectors are closer to each other, or – equivalently – their distance is lower. This fact holds regardless of the "measure" being adopted.

Here is a brief overview of the measures I will consider (which happen to be the choices available on Cassandra and Astra DB). Choosing to commit to a measure or another is largely based on the nature of the data you want to represent as vectors, and how you do it:

Euclidean similarity: "how close to each other the tips of the two arrows are"
Cosine similarity: "by how much are the two arrows pointing to the same direction (regardless of the vectors' lengths)"
Dot-product similarity: this is the oddball of the bunch. You'll see more about "Dot" later on.

Caption: Vectors in two dimensions, on the whole plane (left) and limited to a circle of radius one (a "two-dimensional sphere"). Vectors can be thought of as lists of numbers, but they are usually represented as arrows anchored at an origin point. The "angular distance" on the right is related to the Cosine similarity between two vectors: smaller angle means higher similarity. Likewise, vectors with a smaller Euclidean distance (pictured) are more similar.

There are a few other definitions of vector similarity available; most are variations of the Euclidean family, but you need not be concerned with them here.

Note that generally one prefers to think in terms of "similarity" rather than "distance," as the former lends itself to fewer mathematical complications and more immediate practical applicability.

Let's get the terms right

In a sense, this article is a mathematical essay in disguise. That means that even if I do my best to keep the number of formulae to a minimum, I still need to use the right concepts precisely and with rigor. For this reason, let's start with a few definitions.

For this series of articles, I will consistently use the following terms:

Dimension or dimensionality: denoted by d, this amounts to how many numbers (components) form the vector, or equivalently the number of independent directions in the vector space. For example, vectors denoting positions on a sheet of paper have d=2, while for locations in the space around you vectors with d=3 are needed. d is much higher than that for most AI-related vector applications. Comparing vectors with different dimensions just makes no sense.
Length or norm: for a vector v, this is denoted by |v|. It is the length of the arrow from tip to tail. One can calculate it as the square root of the sum of the squares of all the components:

Measure: a mental model of "what it means for vectors to be close/distant." One can consider a Cosine measure, an Euclidean measure, and so on. Choosing a measure does not mean committing to a precise formula just yet.
As mentioned above, I prefer using the notion of "similarity" over that of "distance" where possible. Cassandra and Astra DB never expose anything that is a "distance," only similarities; moreover, not all measures offer a natural and simple-to-understand "distance" to think about.
Similarity: a numeric way to quantify how much two vectors v1 and v2 are close to each other, computed with some formula S(v1, v2). One expects that S(v1, v2) = S(v2, v1); one also requires this quantity to be higher for pairs of vectors that are more similar to each other. It is desirable (and often verified) that this quantity be bound within a known range: for Cassandra and Astra DB, similarities are chosen so as to lie between zero (most dissimilar, or very high-distance, vectors) and one (most similar). As you will see, though, there is a notable exception to this general rule!
Unit sphere: The set of all vectors with unit length (all the vectors v for which |v| = 1). When I say that vectors are "on a sphere", it is implied that they are on the unit sphere. This special (and very common) case unlocks a few nice properties.

For the mathematically inclined, here are the relevant formulae behind the different similarities. In these, xi with i=1,... d denote the components of vector x (i.e. each one of the numbers in the list representing x).

Not every list of numbers is a vector

An implied assumption when thinking of vectors is that rotations in their space should "make sense" (pardon the non-rigorous parlance—you're not reading a linear algebra textbook). This essentially amounts to their components being "of the same kind", and is critical for a proper interpretation of the "similarity" between vectors (whatever its precise definition).

Suppose you associate a list of numbers such as [price, average_rating] to searchable ecommerce items. These are heterogeneous quantities. As a consequence, it is not well defined what we mean by "two items having distance 5 from each other": the lack of a natural way to rotate vectors in this space invalidates the notion of similarity as well. What a downer!

Caption: Which item is "closer" to the mousepad: the t-shirt or the paperweight? Certainly geometry alone cannot easily answer such a question. In other words, what we are saying is that the problem of comparing a distance of $5 on the money axis to a distance of 5 on the rating axis is not just a geometry problem.

A practical rule then is the following: applying vector search methods is a sensible approach, generally, if and only if the dimensions are of the same kind – in other words, if one can think of summing the numbers in the list in a meaningful way.

This is the case, for instance, for a vector expressing a position on a map as [meters_north_of_home, meters_west_of_home], but also – crucially – for the sentence embedding vectors that are being used so fruitfully in the GenAI world nowadays.

There are, indeed, advanced, "non-conventional" cases where one might concoct and use such a pseudo-vector – here's an example.

Coming up next, we’ll explore one of the most important choices one must face when designing a vector-powered application: which similarity measure should be adopted? We briefly covered the Euclidean, Cosine and Dot-product options (their virtues and their differences) but are these similarities so dissimilar after all? Stay tuned to find out – see you next time!