DEV Community

Akmal Chaudhri for SingleStore

Posted on

Quick tip: Write numpy arrays directly to the SingleStore VECTOR data type

Abstract

The new SingleStore VECTOR data type was introduced in early 2024. It provides a number of benefits over using the BLOB type when working with vector data. Now, numpy arrays can be directly stored in a table with a VECTOR column. In this article, we'll see how.

The notebook file used in this article is available on GitHub.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

  • Workspace Group Name: Iris Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: iris-demo
  • Size: S-00

Create a new notebook

From the left navigation pane in the cloud portal, we'll select Develop > Notebooks.

In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.

Figure 1. New Notebook.

Figure 1. New Notebook.

We'll call the notebook numpy_vector_demo, select a Blank notebook template from the available options, and save it in the Personal location.

Fill out the notebook

First, we'll import some libraries:

import pandas as pd
import numpy as np
Enter fullscreen mode Exit fullscreen mode

Next, we'll load the iris dataset:

url = "https://gist.githubusercontent.com/VeryFatBoy/9af771d443f5ec4dd6eec8d69a062638/raw/c03ef25a97f23a48ee408ac02114195b663a2364/iris.csv"

df = pd.read_csv(url)
Enter fullscreen mode Exit fullscreen mode

and print the first few rows:

df.head()
Enter fullscreen mode Exit fullscreen mode

Example output:

   sepal_length  sepal_width  petal_length  petal_width      species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa
Enter fullscreen mode Exit fullscreen mode

We'll now create a copy of the original DataFrame:

str_df = df.copy()
Enter fullscreen mode Exit fullscreen mode

and then combine the first four columns into a new vector column in a string format and then drop the first four columns, as follows:

str_df["vector"] = str_df.apply(lambda row: f"[{row['sepal_length']}, {row['sepal_width']}, {row['petal_length']}, {row['petal_width']}]", axis = 1)

str_df.drop(columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"], inplace = True)
Enter fullscreen mode Exit fullscreen mode

We now have two columns in our DataFrame.

We can also represent the data using numpy arrays instead of strings.

First, we'll create another copy of our original DataFrame:

num_df = df.copy()
Enter fullscreen mode Exit fullscreen mode

and then combine the first four columns into a new vector column in a numpy 32-bit format and then drop the first four columns, as follows:

num_df["vector"] = num_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].apply(lambda row: np.array(row).astype(np.float32), axis = 1)

num_df.drop(columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"], inplace = True)
Enter fullscreen mode Exit fullscreen mode

We'll now create our database and two tables, as follows:

DROP DATABASE IF EXISTS iris_demo;
CREATE DATABASE IF NOT EXISTS iris_demo;

USE iris_demo;

DROP TABLE IF EXISTS iris_str;
CREATE TABLE IF NOT EXISTS iris_str (
     species VARCHAR(20),
     vector VECTOR(4) NOT NULL
);

DROP TABLE IF EXISTS iris_num;
CREATE TABLE IF NOT EXISTS iris_num (
     species VARCHAR(20),
     vector VECTOR(4) NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

We have one table where we will store the vectors using the string format and another table where we will store the vectors using the numpy format.

We'll now create the connection to SingleStore:

from sqlalchemy import *

db_connection = create_engine(connection_url)
Enter fullscreen mode Exit fullscreen mode

and write the first DataFrame using the string format:

str_df.to_sql(
    "iris_str",
    con = db_connection,
    if_exists = "append",
    index = False,
    chunksize = 1000
)
Enter fullscreen mode Exit fullscreen mode

and then the second DataFrame using the numpy format:

num_df.to_sql(
    "iris_num",
    con = db_connection,
    if_exists = "append",
    index = False,
    chunksize = 1000
)
Enter fullscreen mode Exit fullscreen mode

Using "append" ensures that the table structures we previously defined are preserved.

Both methods enable us to write the original dataset into SingleStore. With support for writing vector data in multiple formats, we can choose the best approach for our use case.

Summary

Besides the existing support for writing vectors as strings, numpy arrays can now be directly written to the VECTOR data type in SingleStore.

Top comments (0)