DEV Community

Akmal Chaudhri for SingleStore

Posted on • Updated on

Quick tip: Dot Product, Euclidean Distance and Cosine Similarity in SingleStoreDB

Abstract

SingleStoreDB has supported vector functions since 2017. In this article, we'll see examples of using Dot Product, Euclidean Distance and Cosine Similarity in SingleStoreDB to assess vector relationships. New vector capabilities will be available in the next major release of SingleStoreDB - stay tuned.

Introduction

Dot Product

The Dot Product is a mathematical procedure that combines two lists of numbers into a single value. This is achieved by multiplying corresponding numbers from each list and then adding up the products. This operation is useful for tasks like measuring the commonality between two sets of values or assessing the influence of one set on another.

Euclidean Distance

Euclidean Distance is a mathematical metric that measures the straight-line distance between two points in a Cartesian coordinate system. To compute it, square the differences between the x and y coordinates of the two points, sum these squared differences, and then take the square root of the result. The final output is a single numerical representation of the spatial separation between the points, like the direct walking distance between them on a graph.

Cosine Similarity

Cosine Similarity is a mathematical measure that gauges the similarity between two sets, such as two lists of words. Instead of focusing on the specific words, it considers the angle between the sets, irrespective of their sizes. Visualise these sets as vectors in a multi-dimensional space. Cosine Similarity essentially assesses how well the directions of these vectors align. High similarity arises when the vectors point in the same direction, while low similarity occurs when they are perpendicular. This method is commonly employed in comparing documents or texts, providing insights into whether they discuss similar topics, regardless of the specific wording.

SingleStoreDB

SingleStoreDB provides direct support for Dot Product and Euclidean Distance using the vector functions DOT_PRODUCT and EUCLIDEAN_DISTANCE, respectively. Cosine Similarity is supported by combining the DOT_PRODUCT and SQRT functions. In this article, we'll see some simple examples of each function.

For further details, see:

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

  • Workspace Group Name: Iris Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: iris-demo
  • Size: S-00

Create a Database and Table

In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this iris_db, as follows:

CREATE DATABASE IF NOT EXISTS iris_db;
Enter fullscreen mode Exit fullscreen mode

We'll also create the iris table, as follows:

USE iris_db;

CREATE TABLE IF NOT EXISTS iris (
     vector BLOB,
     species VARCHAR(20)
);
Enter fullscreen mode Exit fullscreen mode

The Iris data set contains 150 rows of data for three different species of flowers. Each flower has four columns of data: sepal_length, sepal_width, petal_length and petal_width. We can store these four column values together using JSON_ARRAY_PACK in a BLOB format, as follows:

INSERT INTO iris VALUES
(JSON_ARRAY_PACK('[5.1,3.5,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.7,3.2,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.1,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.6,1.4,0.2]'),'Iris-setosa'),
...
(JSON_ARRAY_PACK('[6.7,3,5.2,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.5,5,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3,5.2,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.2,3.4,5.4,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'),'Iris-virginica');
Enter fullscreen mode Exit fullscreen mode

Only the first five and last five rows are shown above. The complete INSERT code listing is available in a GitHub Gist.

Query 1

First, let's try a query where we want to find the name of the flower species using an exact match for the sepal_length, sepal_width, petal_length and petal_width. We'll use the values [5.9,3,5.1,1.8] from the last row of the iris table, shown above.

Dot Product

SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = DOT_PRODUCT(vector, vector);
Enter fullscreen mode Exit fullscreen mode

This SQL query retrieves the species from the iris table for rows where the Dot Product of the vector in the vector column with the specified vector [5.9,3,5.1,1.8] is equal to the Dot Product of the vector column with itself. Since there's only one row with the specified values in the vector column, the query essentially identifies the species for that particular set of values in the iris table.

Euclidean Distance

SELECT species
FROM iris
WHERE EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = 0;
Enter fullscreen mode Exit fullscreen mode

This SQL query retrieves the species from the iris table for rows where the Euclidean Distance between the vector in the vector column and the specified vector [5.9,3,5.1,1.8] is equal to 0. In other words, they are one and the same.

Cosine Similarity

SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'), JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'))) = 1;
Enter fullscreen mode Exit fullscreen mode

This SQL query retrieves the species values from the iris table for rows where the normalised Dot Product of the vector in the vector column with the specified vector [5.9,3,5.1,1.8] equals 1. The condition checks for parallel vectors, indicating a high similarity or identical direction between the two vectors.

The result in each case should be:

+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+
Enter fullscreen mode Exit fullscreen mode

Query 2

Now, let's use some fictitious data values [5.2,3.6,1.5,0.3] to make a prediction.

Dot Product

SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) DESC
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

This SQL query retrieves the species from the iris table and orders the results in descending order based on the Dot Product between the vectors in the vector column and the specified vector [5.2,3.6,1.5,0.3]. The LIMIT 1 ensures that only the top result is returned.

The result should be:

+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+
Enter fullscreen mode Exit fullscreen mode

Euclidean Distance

SELECT species
FROM iris
ORDER BY EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

This SQL query retrieves the species from the iris table and orders the results based on the Euclidean Distance between the vectors in the vector column and the specified vector [5.2,3.6,1.5,0.3]. The LIMIT 1 ensures that only the top result, closest in Euclidean Distance, is returned.

The result should be:

+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+
Enter fullscreen mode Exit fullscreen mode

Cosine Similarity

If we want to use DOT_PRODUCT and achieve a similar result to EUCLIDEAN_DISTANCE, we can use the following approach. The DOT_PRODUCT doesn't directly represent distance, so we need to modify the query. One possible adjustment is to consider the angle between vectors.

SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'), JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))) DESC
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

This SQL query normalises the DOT_PRODUCT by dividing it by the product of the magnitudes of the vectors. This adjustment helps to account for differences in vector magnitudes and makes the DOT_PRODUCT more comparable to a Cosine Similarity, which considers the angle between vectors. The ordering is done in descending order to prioritise higher similarity.

The result should be:

+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+
Enter fullscreen mode Exit fullscreen mode

Summary

In this short article, we've seen several examples of how to use SingleStoreDB's built-in DOT_PRODUCT and EUCLIDEAN_DISTANCE vector functions. We've also seen how we can easily represent Cosine Similarity by combining the DOT_PRODUCT and SQRT functions.

We've used the functions to find an exact match and made predictions based on new, previously unknown, values.

Top comments (0)