DEV Community

Akmal Chaudhri for SingleStore

Posted on • Updated on

Quick tip: Using OpenAI with SingleStoreDB to store and query vectors of Fine Food Reviews

Abstract

This short article will demonstrate another example of working with OpenAI and storing and querying vectors in SingleStoreDB. In a previous article, we saw that SingleStoreDB was adept at working with vectors through its vector functions, using a local installation. This time, however, we'll use SingleStoreDB in the cloud. In a future article, we'll use Docker.

Introduction

In a great article, the author shows how to create vectors for a clustering task. The vectors are eventually stored and queried using a vector database. However, we could store and query the vectors using SingleStoreDB instead. Since the original article code is copyrighted, we'll just provide the additional code required to use SingleStoreDB, instead of a vector database. There is, however, one code correction required in the original article. The following line:

matrix = np.array(df.ada_similarity.apply(eval).to_list())
Enter fullscreen mode Exit fullscreen mode

needs to be changed to:

matrix = np.array(df.ada_similarity.to_list())
Enter fullscreen mode Exit fullscreen mode

Create a SingleStoreDB Cloud account

A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use OpenAI Demo Group as our Workspace Group Name and openai-demo as our Workspace Name. We'll make a note of our password and host name.

We'll use the SQL Editor to create a new database, as follows:

CREATE DATABASE IF NOT EXISTS openai_demo;
Enter fullscreen mode Exit fullscreen mode

Fill out the Notebook

In a Jupyter notebook, at a minimum, we'll need to add the following library:

!pip install singlestoredb --quiet
Enter fullscreen mode Exit fullscreen mode

Next, we'll create a connection to SingleStoreDB, as follows:

import singlestoredb as s2

conn = s2.connect("admin:<password>@<host>:3306/openai_demo")

cur = conn.cursor()
Enter fullscreen mode Exit fullscreen mode

We'll replace the <password> and <host> with the values from our SingleStoreDB Cloud account.

We could store all the Dataframe data in SingleStoreDB, as shown in the previous article, but we'll limit it this time to several columns. Our code below will be similar to the previous article.

First, we'll create the table:

stmt = """
    CREATE TABLE IF NOT EXISTS fine_food_reviews (
        id INT PRIMARY KEY,
        combined TEXT,
        ada_similarity BLOB
    )
"""

cur.execute(stmt)
Enter fullscreen mode Exit fullscreen mode

Next, we'll insert the data into the table, as follows:

# Prepare the statement
stmt = """
    INSERT INTO fine_food_reviews (
        id,
        combined,
        ada_similarity
    )
    VALUES (
        %s,
        %s,
        JSON_ARRAY_PACK_F64(%s)
    )
"""

# Convert the DataFrame to a NumPy record array
record_arr = df.to_records(index=True)

# Set the batch size
batch_size = 100

# Iterate over the rows of the record array in batches
for i in range(0, len(record_arr), batch_size):
    batch = record_arr[i:i+batch_size]
    values = [(
        row[0],
        row[6],
        str(row[8])
    ) for row in batch]
    cur.executemany(stmt, values)
Enter fullscreen mode Exit fullscreen mode

We can also use JSON_ARRAY_PACK_F32 (32-bit, IEEE standard format), instead of JSON_ARRAY_PACK_F64 (64-bit, IEEE standard format).

We'll now create a Python function that will call OpenAI and query SingleStoreDB, returning the values and scores, as follows:

from typing import Tuple, List

def search_reviews(
    query: str,
    num_rows: int = 10
) -> Tuple[List[str], List[float]]:
    """Searches Fine Foods Reviews for the given query and returns the top `num_rows` results.

    Args:
        query: The query to search for.
        num_rows: The number of results to return.

    Returns:
        A list of the top `num_rows` results.
    """

    # Get the embedding of the query
    query_embedding_response = openai.Embedding.create(
        model='text-embedding-ada-002',
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]

    # Create the SQL statement
    stmt = """
        SELECT
            combined,
            DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(%s), ada_similarity) AS score
        FROM fine_food_reviews
        ORDER BY score DESC
        LIMIT %s
    """

    # Execute the SQL statement
    cur.execute(stmt, [str(query_embedding), num_rows])

    # Get the results
    results = cur.fetchall()

    # Separate the results into two lists
    values = [row[0] for row in results]
    scores = [row[1] for row in results]

    # Return the results
    return values, scores
Enter fullscreen mode Exit fullscreen mode

Now we are ready to query the data:

values, scores = search_reviews(
    query = "I have ordered these raisins",
    num_rows = 5
)
Enter fullscreen mode Exit fullscreen mode

We can tabulate the results as follows:

from tabulate import tabulate

# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data = list(zip([value[:50] for value in values], scores))

# Add a rank column to the table data
table_data = [(i + 1,) + data for i, data in enumerate(table_data)]

# Create the table
table = tabulate(table_data, headers=["Rank", "Combined", "Score"])

# Print the table
print(table)
Enter fullscreen mode Exit fullscreen mode

The output should be similar to the following:

Rank    Combined                                               Score
------  --------------------------------------------------  --------
     1  Title: Delicious!; Content: I have ordered these r  0.880185
     2  Title: Just what I expected!; Content: I bought th  0.838868
     3  Title: its delicious...; Content: I was surprised   0.820137
     4  Title: Perfect Gift; Content: I got these to give   0.814143
     5  Title: a great product and deal; Content: I looked  0.811726
Enter fullscreen mode Exit fullscreen mode

A quick visual inspection shows that the results are similar to those in the original article.

Summary

In this short article, we have seen another example where SingleStoreDB can store and query vectors. In future articles, we'll look at further examples and scenarios where we might wish to perform additional queries on the database beyond just querying the vector data.

Top comments (0)