Abstract
This short article will demonstrate another example of working with OpenAI and storing and querying vectors in SingleStoreDB. In a previous article, we saw that SingleStoreDB was adept at working with vectors through its vector functions, using a local installation. This time, however, we'll use SingleStoreDB in the cloud. In a future article, we'll use Docker.
Introduction
In a great article, the author shows how to create vectors for a clustering task. The vectors are eventually stored and queried using a vector database. However, we could store and query the vectors using SingleStoreDB instead. Since the original article code is copyrighted, we'll just provide the additional code required to use SingleStoreDB, instead of a vector database. There is, however, one code correction required in the original article. The following line:
matrix = np.array(df.ada_similarity.apply(eval).to_list())
needs to be changed to:
matrix = np.array(df.ada_similarity.to_list())
Create a SingleStoreDB Cloud account
A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use OpenAI Demo Group as our Workspace Group Name and openai-demo as our Workspace Name. We'll make a note of our password and host name.
We'll use the SQL Editor to create a new database, as follows:
CREATE DATABASE IF NOT EXISTS openai_demo;
Fill out the Notebook
In a Jupyter notebook, at a minimum, we'll need to add the following library:
!pip install singlestoredb --quiet
Next, we'll create a connection to SingleStoreDB, as follows:
import singlestoredb as s2
conn = s2.connect("admin:<password>@<host>:3306/openai_demo")
cur = conn.cursor()
We'll replace the <password>
and <host>
with the values from our SingleStoreDB Cloud account.
We could store all the Dataframe data in SingleStoreDB, as shown in the previous article, but we'll limit it this time to several columns. Our code below will be similar to the previous article.
First, we'll create the table:
stmt = """
CREATE TABLE IF NOT EXISTS fine_food_reviews (
id INT PRIMARY KEY,
combined TEXT,
ada_similarity BLOB
)
"""
cur.execute(stmt)
Next, we'll insert the data into the table, as follows:
# Prepare the statement
stmt = """
INSERT INTO fine_food_reviews (
id,
combined,
ada_similarity
)
VALUES (
%s,
%s,
JSON_ARRAY_PACK_F64(%s)
)
"""
# Convert the DataFrame to a NumPy record array
record_arr = df.to_records(index=True)
# Set the batch size
batch_size = 100
# Iterate over the rows of the record array in batches
for i in range(0, len(record_arr), batch_size):
batch = record_arr[i:i+batch_size]
values = [(
row[0],
row[6],
str(row[8])
) for row in batch]
cur.executemany(stmt, values)
We can also use JSON_ARRAY_PACK_F32
(32-bit, IEEE standard format), instead of JSON_ARRAY_PACK_F64
(64-bit, IEEE standard format).
We'll now create a Python function that will call OpenAI and query SingleStoreDB, returning the values
and scores
, as follows:
from typing import Tuple, List
def search_reviews(
query: str,
num_rows: int = 10
) -> Tuple[List[str], List[float]]:
"""Searches Fine Foods Reviews for the given query and returns the top `num_rows` results.
Args:
query: The query to search for.
num_rows: The number of results to return.
Returns:
A list of the top `num_rows` results.
"""
# Get the embedding of the query
query_embedding_response = openai.Embedding.create(
model='text-embedding-ada-002',
input=query,
)
query_embedding = query_embedding_response["data"][0]["embedding"]
# Create the SQL statement
stmt = """
SELECT
combined,
DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(%s), ada_similarity) AS score
FROM fine_food_reviews
ORDER BY score DESC
LIMIT %s
"""
# Execute the SQL statement
cur.execute(stmt, [str(query_embedding), num_rows])
# Get the results
results = cur.fetchall()
# Separate the results into two lists
values = [row[0] for row in results]
scores = [row[1] for row in results]
# Return the results
return values, scores
Now we are ready to query the data:
values, scores = search_reviews(
query = "I have ordered these raisins",
num_rows = 5
)
We can tabulate the results as follows:
from tabulate import tabulate
# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data = list(zip([value[:50] for value in values], scores))
# Add a rank column to the table data
table_data = [(i + 1,) + data for i, data in enumerate(table_data)]
# Create the table
table = tabulate(table_data, headers=["Rank", "Combined", "Score"])
# Print the table
print(table)
The output should be similar to the following:
Rank Combined Score
------ -------------------------------------------------- --------
1 Title: Delicious!; Content: I have ordered these r 0.880185
2 Title: Just what I expected!; Content: I bought th 0.838868
3 Title: its delicious...; Content: I was surprised 0.820137
4 Title: Perfect Gift; Content: I got these to give 0.814143
5 Title: a great product and deal; Content: I looked 0.811726
A quick visual inspection shows that the results are similar to those in the original article.
Summary
In this short article, we have seen another example where SingleStoreDB can store and query vectors. In future articles, we'll look at further examples and scenarios where we might wish to perform additional queries on the database beyond just querying the vector data.
Top comments (0)