Abstract
In a previous article, we saw how to represent Cosine Similarity in SingleStore by combining the DOT_PRODUCT and SQRT functions. The SingleStore documentation also provides a way to implement a COSINE_SIMILARITY function. In this article, we'll see how.
Create a SingleStoreDB Cloud account
A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:
- Workspace Group Name: Iris Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: iris-demo
- Size: S-00
Create a Database and Table
In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this iris_db, as follows:
CREATE DATABASE IF NOT EXISTS iris_db;
We'll also create the iris table using the new VECTOR data type, as follows:
USE iris_db;
CREATE TABLE IF NOT EXISTS iris (
    vector VECTOR(4),
    species VARCHAR(20)
);
The Iris data set contains 150 rows of data for three different species of flowers. Each flower has four columns of data: sepal_length, sepal_width, petal_length and petal_width. We can store these four column values together, as follows:
INSERT INTO iris VALUES
('[5.1,3.5,1.4,0.2]','Iris-setosa'),
('[4.9,3,1.4,0.2]','Iris-setosa'),
('[4.7,3.2,1.3,0.2]','Iris-setosa'),
('[4.6,3.1,1.5,0.2]','Iris-setosa'),
('[5,3.6,1.4,0.2]','Iris-setosa'),
...
('[6.7,3,5.2,2.3]','Iris-virginica'),
('[6.3,2.5,5,1.9]','Iris-virginica'),
('[6.5,3,5.2,2]','Iris-virginica'),
('[6.2,3.4,5.4,2.3]','Iris-virginica'),
('[5.9,3,5.1,1.8]','Iris-virginica');
Only the first five and last five rows are shown above. The complete INSERT code listing is available in a GitHub Gist.
Create a Cosine Similarity Function
From the product documentation, we'll now define two functions that we can use:
DELIMITER //
CREATE OR REPLACE FUNCTION NORMALIZE(v VECTOR(4)) RETURNS VECTOR(4) AS
DECLARE
    squares VECTOR(4) = VECTOR_MUL(v, v);
    length FLOAT = SQRT(VECTOR_ELEMENTS_SUM(squares));
BEGIN
    RETURN SCALAR_VECTOR_MUL(1/length, v);
END //
DELIMITER ;
and
DELIMITER //
CREATE OR REPLACE FUNCTION COSINE_SIMILARITY(v1 VECTOR(4), v2 VECTOR(4)) RETURNS FLOAT AS
BEGIN
    RETURN DOT_PRODUCT(NORMALIZE(v1), NORMALIZE(v2));
END //
DELIMITER ;
In the SingleStore notebook environment, the code would be as follows:
CREATE OR REPLACE FUNCTION NORMALIZE(v VECTOR(4)) RETURNS VECTOR(4) AS
DECLARE
    squares VECTOR(4) = VECTOR_MUL(v, v);
    length FLOAT = SQRT(VECTOR_ELEMENTS_SUM(squares));
BEGIN
    RETURN SCALAR_VECTOR_MUL(1/length, v);
END;
and
CREATE OR REPLACE FUNCTION COSINE_SIMILARITY(v1 VECTOR(4), v2 VECTOR(4)) RETURNS FLOAT AS
BEGIN
    RETURN DOT_PRODUCT(NORMALIZE(v1), NORMALIZE(v2));
END;
In other words, remove the references to DELIMITER and //.
Query 1
First, let's try a query where we want to find the name of the flower species using an exact match for the sepal_length, sepal_width, petal_length and petal_width. We'll use the values [5.9,3,5.1,1.8] from the last row of the iris table, shown above.
SELECT species
FROM iris
ORDER BY COSINE_SIMILARITY(vector, '[5.9,3,5.1,1.8]') DESC
LIMIT 1;
The result should be:
+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+
Query 2
Now, let's use some fictitious data values [5.2,3.6,1.5,0.3] to make a prediction.
SELECT species
FROM iris
ORDER BY COSINE_SIMILARITY(vector, '[5.2,3.6,1.5,0.3]') DESC
LIMIT 1;
The result should be:
+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+
Cleanup:
DROP TABLE IF EXISTS iris;
DROP DATABASE IF EXISTS iris_db;
Summary
SingleStore provides direct support for the DOT_PRODUCT and EUCLIDEAN_DISTANCE functions. We can easily implement our own COSINE_SIMILARITY function as shown in this example.
 
 
              
 
                       
    
Top comments (0)