DEV Community

Cover image for What is Vector Similarity Search?
Pavan Belagatti
Pavan Belagatti

Posted on • Originally published at singlestore.com

What is Vector Similarity Search?

In the multifaceted world of generative AI, data science, machine learning and analytics, understanding the distance and relationship between data points is crucial. This is how Large Language Models (LLMs) understand the context and retrieve the data. Large amounts of unstructured data can be easily stored in a vector database that is represented in a three-dimensional space.

Whenever a user comes up with a query, the most approximate answer is retrieved from the database. It is amazing to see how the data is stored and how this calculation retrieves particular information.

There are many advanced ways to calculate the distance between the vectors. In this article, we are going to explore some important distance metrics like Manhattan distance, Euclidean distance, Cosine distance and dot product.

What are vectors in the context of machine learning?

vector

In the realm of generative AI, vectors play a crucial role as a means of representing and manipulating complex data. Within this context, vectors are often high-dimensional arrays of numbers that encode significant amounts of information. For instance, in the case of image generation, each image can be converted into a vector representing its pixel values or more abstract features extracted through deep learning models.

These vectors become the language through which AI algorithms understand and generate new content. By navigating and modifying these vectors in a multidimensional space, generative AI produces new, synthetic instances of data — whether images, sounds or text — that mimic the characteristics of the original dataset. This vector manipulation is at the heart of AI's ability to learn from data and generate realistic outputs based on that learning.

What is vector similarity search?

Queries involve finding the nearest neighbors to a given vector in the high-dimensional space. This process, known as vector similarity search or Approximate Nearest Neighbor (ANN) search, looks for vectors that are closest in terms of distance (e.g., Euclidean distance or Cosine similarity) to the query vector. This method captures the similarity in context or features between different data points, crucial for tasks like recommendation systems, similarity searches and clustering.

vector search

Consider the unstructured data from various sources as shown in the previous image. The data gets through an embedding model (from OpenAI, Cohere or HuggingFace, etc.) to convert the content into small word chunks (the vector embeddings). Each word chunk is assigned with a numerical value and this process is known as tokenization. After the vector embeddings are created, they are represented in a three-dimensional space inside the vector database where all this vector data is stored.

In our case, as shown in the preceding image, the unstructured data ‘cat’ and ‘dog’ has been converted into vector data first, then represented in a three-dimensional space closer to each other since they both belong to the same pets category. If we consider one more example of a ‘car’ here, then it gets a representation far away from the ‘cat’ and ‘dog’ since it doesn’t fall under the pets category.

This way the similar objects are placed together and when a user query comes in, the context is considered and a more similar object is retrieved back to the user.

How vector similarity search is calculated

There are different techniques to calculate the distance between the vectors. Dot product, Euclidean distance, Manhattan distance and cosine distance are all fundamental concepts used in vector similarity search. Each one measures the similarity between two vectors in a multi-dimensional space.

Manhattan distance

manhattan distance

Manhattan distance is a way of calculating the distance between two points (vectors) by summing the absolute differences of their coordinates. Imagine navigating a city laid out in a perfect grid: the Manhattan distance between two points is the total number of blocks you'd have to travel vertically and horizontally to get from one point to the other, without the possibility of taking a diagonal shortcut. In the context of vectors, it's similar: you calculate the distance by adding up the absolute differences for each corresponding component of the vectors.

Euclidean distance

euclidean distance

Euclidean distance, often known as the L2 norm, is the most direct way of measuring the distance between two points or vectors, resembling the way we usually think about distance in the physical world. Imagine drawing a straight line between two points on a map; the Euclidean distance is the length of this line.

This metric is widely used in many fields including physics for measuring actual distances, machine learning algorithms for tasks like clustering and classification, and in everyday scenarios whenever a direct or as-the-crow-flies distance needs to be determined. Its natural alignment with our intuitive understanding of distance makes it a fundamental tool in data analysis and geometry.

Cosine distance

cosine

Cosine distance is a measure of similarity between two non-zero vectors that evaluates the cosine of the angle between them. It's not a 'distance' in the traditional sense, but rather a metric that determines how vectors are oriented relative to each other, regardless of their magnitude. Picture two arrows starting from the same point; the smaller the angle between them, the more similar they are in direction.

This measure is particularly useful in fields like text analysis and information retrieval, where the orientation of the vectors (representing, for example, documents or queries in a high-dimensional space) matters more than their absolute positions or magnitudes.

Dot product

cosine

The dot product captures the relationship between two vectors. Imagine two arrows extending from the same starting point; the dot product quantifies how much one arrow aligns with the direction of the other. Mathematically, it's calculated by multiplying corresponding components of the vectors and then summing these products.

A key characteristic of the dot product is that it's large (positive or negative) when vectors point in similar or opposite directions, and small (or zero) when the vectors are perpendicular. This makes the dot product extremely useful in various applications, like determining if two vectors are orthogonal, calculating the angle between vectors in space or in more complex physics and engineering operations like computing work or torque. In machine learning and data science, the dot product is instrumental in algorithms like neural networks, where it helps in calculating the weighted sum of inputs.

Tutorial

We will use SingleStore Notebooks to perform this tutorial. If you haven’t already, activate your free SingleStore trial to get started. Let’s dive deeper into the previously mentioned approaches through a hands-on tutorial.

Let's take an example of two pets (cat and dog) and visualize them in a 3D space. We will try to find the Manhattan distance, Euclidean distance, Cosine distance and dot product between these two pets.

pets

Once you sign up, click on the Notebooks tab and create a blank Notebook with a name.

singlestore

singlestore notebook

You can start running the code in your newly created Notebook. First, install and then import the required libraries.

!pip install numpy
!pip install matplotlib

import matplotlib.pyplot as plt
import numpy as np
Enter fullscreen mode Exit fullscreen mode

Define attributes of the pets that can be represented in three dimensions. Since pets are complex entities with many characteristics, you'll need to simplify this to a 3D representation. Here's an example:

from mpl_toolkits.mplot3d import Axes3D

# Example pets attributes: [weight, height, age]
# These are hypothetical numbers for illustration purposes
dog = [5, 30, 2]
cat = [3, 25, 4]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Plotting the pets
ax.scatter(dog[0], dog[1], dog[2], label="Dog", c='blue')
ax.scatter(cat[0], cat[1], cat[2], label="Cat", c='green')

# Drawing lines from the origin to the points
ax.quiver(0, 0, 0, dog[0], dog[1], dog[2], color='blue',
arrow_length_ratio=0.1)
ax.quiver(0, 0, 0, cat[0], cat[1], cat[2], color='green', 
arrow_length_ratio=0.1)

# Labeling the axes
ax.set_xlabel('Weight (kg)')
ax.set_ylabel('Height (cm)')
ax.set_zlabel('Age (years)')

# Setting the limits for better visualization
ax.set_xlim(0, 10)
ax.set_ylim(0, 40)
ax.set_zlim(0, 5)

# Adding legend and title
ax.legend()
ax.set_title('3D Representation of Pets')

plt.show()
Enter fullscreen mode Exit fullscreen mode

Upon executing the code, you should be able to see the representation of pets mentioned in this 3D space.

pets represented

Now, let’s calculate the distance between the vectors using various techniques.

Manhattan distance

L1 = [abs(dog[i] - cat[i]) for i in range(len(dog))]
manhattan_distance = sum(L1)

print("Manhattan Distance:", manhattan_distance)
Enter fullscreen mode Exit fullscreen mode

Euclidean distance

L2 = [(dog[i] - cat[i])**2 for i in range(len(dog))]

L2 = np.sqrt(np.array(L2).sum())

print(L2)
Enter fullscreen mode Exit fullscreen mode

Cosine distance

cosine = np.dot(dog, cat) / (np.linalg.norm(dog) * 
np.linalg.norm(cat))
print(cosine)
Enter fullscreen mode Exit fullscreen mode

Dot product

np.dot(dog,cat)
Enter fullscreen mode Exit fullscreen mode

The complete code of this tutorial can be found here in this Github repository: Representing Unstructured Data as Vectors.

Activate your free SingleStore trial to get started with the above tutorial.

Top comments (5)

Collapse
 
nathan_tarbert profile image
Nathan Tarbert

Great article Pavan on VSS

Collapse
 
pavanbelagatti profile image
Pavan Belagatti

Thanks Nathan:)

Collapse
 
programcrafter profile image
ProgramCrafter

That's cool way to compare data! Though, are there any local models which generate image embeddings?

Collapse
 
ranjancse profile image
Ranjan Dailata • Edited

Please take a look into this blog post - visual-similarity-search

They are making use of img2vec

Collapse
 
pavanbelagatti profile image
Pavan Belagatti

That's a good question. Let me search and see if there are any local models that help with image embeddings.