I initially thought integrating vector databases would be as simple as learning any other Python library, assuming I could skip complex configurations until later. I was wrong. Tools like Milvus required an immediate deep dive into specific settings, such as metric types, before I could even successfully interact with the database.
The Metric Type.
This isn't just a random math preference that you can just choose for the sake of it; it’s the definition of "similarity" between each data that you have in your vector database. Mathematically speaking, it defines how we can measure the distance between two chunks in a vector space. It can be through the straight line distance between each other, it can be based on the angle of difference, or it can also be using both measurements.
Definition of the metric type is first based on one thing: the type of data you are working with. Second one is based on the embedding model you are using which must be based also on the type of data you are working with, whether it is an image, text, or audio (that is why this must be second thing to consider).
It is defined during the index setup, and if you get it wrong, you’re essentially speaking a different language than your embedding model. Two vectors might look "close" in a straight line in the vector space, but semantically, they could be unrelated.
Here is the breakdown of the three main metric types and when to actually use them.
1. Cosine Similarity (The NLP Standard)
What it measures: The angle between two vectors.
What it ignores: Magnitude (length).
For text and NLP, this is usually your default. Modern dense embedding models encode semantic meaning in the direction of the vector, not the length.
Think of it this way:
- Vector A: The sentence "I love cats."
- Vector B: A 500-word essay about how much the author loves cats.
If you plot these, they will point in roughly the same direction (high cosine similarity) because the topic is identical. However, Vector B will likely have a much larger magnitude because it has more words/features. If you used a distance metric that cared about length, these two documents would seem far apart. Cosine ignores the length and focuses purely on the vibe (direction).
Note on Sparse vs. Dense: In older sparse vector models (like TF-IDF), magnitude represented word frequency or importance. In modern dense embeddings (like OpenAI’s
text-embedding-3), magnitude is often normalized or irrelevant to meaning.
2. L2 / Euclidean Distance (The Physical Distance)
What it measures: The straight-line distance between points in space.
What it cares about: Both direction and magnitude.
Euclidean distance is what we typically think of when we say "distance" in the real world—measuring with a ruler from point A to point B.
When to use it:
This is heavily used in Computer Vision (images) and Audio processing.
In image data, the "magnitude" often carries valid information, such as pixel intensity, color depth, or the sheer number of features present. If you have a dark image and a bright image of the same object, the pixel values are different, creating a distance in the vector space that L2 captures effectively.
3. Inner Product (IP) (The Speed Demon)
What it measures: The projection of one vector onto another.
What it cares about: Direction and magnitude.
This is the standard for Recommendation Systems. In these use cases, magnitude often correlates with the "strength" of a user's preference or the popularity of an item. You want to capture both what they like (direction) and how much they like it (magnitude).
The "Free" Performance Hack
Here is a trick that is worth knowing if you are optimizing for scale.
If your embedding model outputs normalized vectors (vectors with a length of 1), Inner Product (IP) and Cosine Similarity become mathematically equivalent in terms of ranking order.
Why does this matter?
IP is computationally cheaper to calculate than Cosine.
Cosine requires a division operation to normalize the vectors during the calculation. If you know your vectors are already normalized (which many modern models do by default), you can set your metric type to IP to get the same accuracy as Cosine, but with faster search performance.
TL;DR
Don't just leave the metric type on default.
- Text/RAG? Check your model documentation. It's usually Cosine.
- Images/Audio? Likely L2 (Euclidean).
- Normalized Embeddings? Use IP for a free speed boost.
Top comments (0)