Akmal Chaudhri for SingleStore

Posted on May 1 • Updated on Nov 1

Quick tip: Write numpy arrays directly to the SingleStore VECTOR data type

#singlestoredb #vectordatabase #numpy

Abstract

The new SingleStore VECTOR data type was introduced in early 2024. It provides a number of benefits over using the BLOB type when working with vector data. Now, numpy arrays can be directly stored in a table with a VECTOR column. In this article, we'll see how.

The notebook file used in this article is available on GitHub.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

Workspace Group Name: Iris Demo Group
Cloud Provider: AWS
Region: US East 1 (N. Virginia)
Workspace Name: iris-demo
Size: S-00

Create a new notebook

From the left navigation pane in the cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.

Figure 1. New Notebook.

We'll call the notebook numpy_vector_demo, select a Blank notebook template from the available options, and save it in the Personal location.

Fill out the notebook

First, we'll import some libraries:

import pandas as pd
import numpy as np

Next, we'll load the iris dataset:

url = "https://gist.githubusercontent.com/VeryFatBoy/9af771d443f5ec4dd6eec8d69a062638/raw/c03ef25a97f23a48ee408ac02114195b663a2364/iris.csv"

df = pd.read_csv(url)

and print the first few rows:

df.head()

Example output:

   sepal_length  sepal_width  petal_length  petal_width      species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

We'll now create a copy of the original DataFrame:

str_df = df.copy()

and then combine the first four columns into a new vector column in a string format and then drop the first four columns, as follows:

str_df["vector"] = str_df.apply(lambda row: f"[{row['sepal_length']}, {row['sepal_width']}, {row['petal_length']}, {row['petal_width']}]", axis = 1)

str_df.drop(columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"], inplace = True)

We now have two columns in our DataFrame.

We can also represent the data using numpy arrays instead of strings.

First, we'll create another copy of our original DataFrame:

num_df = df.copy()

and then combine the first four columns into a new vector column in a numpy 32-bit format and then drop the first four columns, as follows:

num_df["vector"] = num_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].apply(lambda row: np.array(row).astype(np.float32), axis = 1)

num_df.drop(columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"], inplace = True)

We'll now create our database and two tables, as follows:

DROP DATABASE IF EXISTS iris_db;
CREATE DATABASE IF NOT EXISTS iris_db;

USE iris_db;

DROP TABLE IF EXISTS iris_str;
CREATE TABLE IF NOT EXISTS iris_str (
    species VARCHAR(20),
    vector VECTOR(4) NOT NULL
);

DROP TABLE IF EXISTS iris_num;
CREATE TABLE IF NOT EXISTS iris_num (
    species VARCHAR(20),
    vector VECTOR(4) NOT NULL
);

We have one table where we will store the vectors using the string format and another table where we will store the vectors using the numpy format.

We'll now create the connection to SingleStore:

from sqlalchemy import *

db_connection = create_engine(connection_url)

and write the first DataFrame using the string format:

str_df.to_sql(
    "iris_str",
    con = db_connection,
    if_exists = "append",
    index = False,
    chunksize = 1000
)

and then the second DataFrame using the numpy format:

num_df.to_sql(
    "iris_num",
    con = db_connection,
    if_exists = "append",
    index = False,
    chunksize = 1000
)

Using "append" ensures that the table structures we previously defined are preserved.

Both methods enable us to write the original dataset into SingleStore. With support for writing vector data in multiple formats, we can choose the best approach for our use case.

Summary

Besides the existing support for writing vectors as strings, numpy arrays can now be directly written to the VECTOR data type in SingleStore.

DEV Community

Quick tip: Write numpy arrays directly to the SingleStore VECTOR data type

Abstract

Create a SingleStoreDB Cloud account

Create a new notebook

Fill out the notebook

Summary

Top comments (0)

Read next

Sniffing Out Code Smells and Design Smells: A Guide to Cleaner Code | Web Theory: Part 13

My minimal VS Code setup for Vue, Typescript, Tailwind CSS, etc

AIM Weekly for 04Nov2024

How to Safely Delete Local and Remote Branches in Git