DEV Community

Akmal Chaudhri for SingleStore

Posted on

Quick tip: Visualise OpenAI Vector Embeddings using Plotly Express

Abstract

This article demonstrates how to visualise OpenAI vector embeddings for a search term using t-SNE and Plotly Express. We build on the work from a previous article, where we showed how to adapt an OpenAI example to work with SingleStoreDB. With some minor code modifications, we can use the same example to visualise vector embeddings for a search term.

The notebook file used in this article is available on GitHub.

Introduction

In a great article, the author demonstrates how to visualise vector embeddings using several technologies. We can use a previous OpenAI example dataset, simplify the code, and use Plotly Express to render a similar visualisation. Let's see how.

As described in a previous article, we'll follow the instructions to create a Notebook.

Fill out the Notebook

First, we'll install the OpenAI library:

!pip install openai --quiet
Enter fullscreen mode Exit fullscreen mode

Next, we'll specify our embedding model:

import openai

EMBEDDING_MODEL = "text-embedding-ada-002"
Enter fullscreen mode Exit fullscreen mode

Next, we'll set our OpenAI API Key:

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
openai.api_key = os.environ["OPENAI_API_KEY"]
Enter fullscreen mode Exit fullscreen mode

Now we'll add a few more libraries:

!pip install matplotlib --quiet
!pip install scikit-learn --quiet
!pip install wget --quiet
Enter fullscreen mode Exit fullscreen mode

and imports:

import numpy as np
import pandas as pd
import wget
import ast
Enter fullscreen mode Exit fullscreen mode

We'll now download a CSV file from OpenAI that contains text and embeddings related to the Winter Olympics 2022:

embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"

file_path = "winter_olympics_2022.csv"

if not os.path.exists(file_path):
    wget.download(embeddings_path, file_path)
    print("File downloaded successfully.")
else:
    print("File already exists in the local file system.")
Enter fullscreen mode Exit fullscreen mode

Now we'll read the file into a Dataframe and convert the data to a NumPy Array:

df = pd.read_csv(
    "winter_olympics_2022.csv"
)

# Convert embeddings from CSV str type to NumPy Array
embedding_array = np.array(
    df['embedding'].apply(ast.literal_eval).to_list()
)
Enter fullscreen mode Exit fullscreen mode

Our search term is "curling gold medal", and we'll get the vector embeddings for this from OpenAI:

from openai.embeddings_utils import get_embedding

query = "curling gold medal"
query_embedding_response = np.array(
    get_embedding(query, EMBEDDING_MODEL)
)
Enter fullscreen mode Exit fullscreen mode

Now we'll find and store the Euclidean Distance between the search term and the vector embeddings we previously loaded:

from scipy.spatial.distance import cdist

df['distance'] = cdist(
    embedding_array,
    [query_embedding_response]
)
Enter fullscreen mode Exit fullscreen mode

and scale the values between 0 and 1, then store them, as follows:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(df[['distance']])

df['normalised'] = scaler.transform(df[['distance']])
Enter fullscreen mode Exit fullscreen mode

Finally, we'll create a t-SNE model and plot the data using Plotly Express:

import plotly.express as px
from sklearn.manifold import TSNE

# Create a t-SNE model
tsne_model = TSNE(
    n_components = 2,
    perplexity = 15,
    random_state = 42,
    init = 'random',
    learning_rate = 200
)
tsne_embeddings = tsne_model.fit_transform(embedding_array)

# Create a DataFrame for visualisation
visualisation_data = pd.DataFrame(
    {'x': tsne_embeddings[:, 0],
     'y': tsne_embeddings[:, 1],
     'Similarity': df['normalised']}
)

# Create the scatter plot using Plotly Express
plot = px.scatter(
    visualisation_data,
    x = 'x',
    y = 'y',
    color = 'Similarity',
    color_continuous_scale = 'rainbow',
    opacity = 0.3,
    title = f"Similarity to '{query}' visualised using t-SNE"
)

plot.update_layout(
    width = 650,
    height = 650
)

# Show the plot
plot.show()
Enter fullscreen mode Exit fullscreen mode

The output should be as shown in Figure 1.

Figure 1. Similarity to 'curling gold medal' visualised using t-SNE.

Figure 1. Similarity to 'curling gold medal' visualised using t-SNE.

Colours specify similarity to the search term. In this case, we can see red areas on the plot that are closer and blue areas that are further away.

Summary

Data visualisation can be used to gain insights into the distribution of data, as seen in a previous article. In this article, we saw how to use vector embeddings and a search term to create a t-SNE model and visualise it using Plotly Express. This simple example showed how to use data visualisation to identify patterns and trends.

Top comments (0)