Cole Gawin

Posted on Jul 7, 2023

Leveraging Vector Embeddings and Similarity Search to Supplement ChatGPT’s Training Data

#chatgpt #nlp #ai #python

Since the release of OpenAI’s ChatGPT, and later its REST API, the technology world has been enveloped by the vast capabilities of this paradigm-shifting technology and pushing it to its limits. However, for as many strengths as it has, ChatGPT has two pronounced weaknesses: 1) its training data cutoff is 2021, meaning it has no knowledge of the information in more recent times, and 2) it is prone to hallucinations or making up information.

To circumvent this issue, software developers and prompt engineers alike have taken on the task of supplementing ChatGPT’s knowledge base with additional information that can provide more relevant, reliable, and informed responses. This process involves a combination of technologies and prompt engineering that we’ll discuss in depth in this series.

In this article, we'll explore how we can transform textual data into vector embeddings and utilize similarity search to find related documents that can help ChatGPT answer questions with reliable, domain-specific information. We generated a databank of documents from Nourish by WebMD in the previous article in this series—we will use that databank as the documents for our vector embeddings and similarity search.

This is part two of my series on creating a nutrition chatbot powered by data from WebMD using sentence-transformers, FAISS, and ChatGPT.

Part 1: Webscraping Techniques to Source Reliable Information for ChatGPT

Part 2: Leveraging Vector Embeddings and Similarity Search to Supplement ChatGPT’s Training Data

Part 3: ChatGPT Prompt-Engineering Techniques for Providing Contextual Information (coming soon)

Vector embeddings are mathematical representations of words or phrases in a high-dimensional vector space. These embeddings encode both syntactic and semantic between words and phrases, and the importance of this nuance cannot be understated. Thanks to vector embeddings, we are essentially able to represent the meaning of a text snippet with numbers, which allows us to do mathematical operations on them.

One of those mathematical operations is similarity search, which compares the distance between two vectors to determine how "similar" they are. For example, if you were to create vector embeddings for the words "dog", "cat", and "house", the "dog" and "cat" vectors would most likely have the least distance and therefore be the most similar since they are most closely related. There are different distance metrics that can be used for similarity search, such as cosine distance or Euclidean distance.

To turn our nutrition databank into vector embeddings, we'll use the sentence-transformers python library. Sentence Transformers comes with many pre-trained models that excel at vectorizing text and run locally on our machine. Alternatively, you could use OpenAI's embeddings API, but that is much more costly and much slower for little marginal benefit.

To index our vector embeddings and perform similarity search over them, we'll use the FAISS library developed by Meta AI. FAISS comes with superb tooling for efficient vector index creation, and allows us to save our indices to disk.

You can do this project in a Jupyter Notebook or on a platform like Google Colab. Alternatively, you can create a new Python project in an IDE like Pycharm or VS Code. I recommend using a Jupyter Notebook for following along with this article.

The code in this article is also available for you to follow along (and run yourself) on Google Colab.

Getting Started

Let’s first install the necessary dependencies for this project:

pip install sentence_transformers faiss-cpu numpy

We're installing faiss-cpu here, but if you are running this on a machine with a capable GPU, you could alternatively install faiss-gpu. We'll be using numpy since it interfaces well with sentence-transformers and FAISS.

Generate Vector Embeddings

Sentence Transformers offers a multitude of pre-trained models to vectorize text, ranked on its documentation. Since our ultimate goal is to create a chatbot in which users ask questions and the bot responds with an answer, it makes most sense to choose a "QA" model that excels in semantic search—furthermore, the faster the better if we want to provide a quality user experience. I chose to use the multi-qa-MiniLM-L6-cos-v1 model since it is extremely fast and takes up little disk space and performs highly with semantic search.

Let's write a function generate_embeddings that uses Sentence Transformers to encode text into a vector:

import numpy as np
from sentence_transformers import SentenceTransformer

# Load the pre-trained SentenceTransformer model for generating sentence embeddings.
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1", device="cpu")


def generate_embedding(text):
    response = model.encode([text])  # Encode the text using the pre-trained model
    return np.array(response[0])  # Return the generated embedding as a NumPy array

This function can be called with any piece of text, and it will use sentence-transformers to embed that text into a vector!

We're only encoding one piece of text at a time here by providing an array with one element to model.encode, but you could vectorize multiple at a time if you wish.

We'll also create a utility VectorStore class that will store all of our documents and their respective embeddings. Each document in documents will have an embedding in embeddings stored at the corresponding index—i.e. documents[0] will correspond to embeddings.iloc[0].

import numpy as np


class VectorStore:
  def __init__(self):
    self.documents = []
    self.embeddings = np.empty((0,384))  # Initialize as empty array

  def add_to_store(self, document):
    # Append the document to the list of documents
    self.documents.append(document)

    # Generate the embedding for the document
    embedding = generate_embedding(document.content)

    # Concatenate the response with the existing embeddings vertically
    self.embeddings = np.vstack((self.embeddings, embedding))

add_to_store will generate an embeddings for the content of the provided content, and append the new embedding to the existing embeddings (vertical stacking).

Note that we initialize self.embeddings to a numpy array of shape (0,384)—this is because the multi-qa-MiniLM-L6-cos-v1 Sentence Transformers model provides a vector with 384 dimensions.

Now, we can create our vector store by iterating through each of the documents in docs:

def generate_vector_store():
  store = VectorStore()

  for i in range(len(docs)):
    print(f"Processing {i}...")
    store.add_to_store(docs[i])

  return store

Once we run this function, we will have a vector store that contains all of our documents and their corresponding vector embeddings! Neat, right?

Saving our FAISS index

We need a way to save our vector store with the documents and vector embeddings to disk so we can import and load them on demand. To accomplish this, we'll use FAISS.

First, we have to create an index from our vector embeddings. FAISS provides many indexing algorithms, each with their own pros and cons. You can read more on these algorithms (flat indexes, locality sensitive hashing, hierarchical navigable small worlds, etc.) on Pinecone's incredible FAISS tutorial. For the sake of simplicity, we will just use IndexFlatL2, but I encourage you to experiment with other indexing algorithms!

import faiss


def create_index(embeddings):
  # Create an index with the same dimension as the embeddings
  index = faiss.IndexFlatL2(embeddings.shape[1])

  # Add the embeddings to the index
  index.add(embeddings)

  # Return the created index
  return index

Great, now we have our FAISS index ready to store to disk. To do so, FAISS provides a write_index method:

import faiss

faiss.write_index(store.create_index(), 'index.faiss')

And that's it! Pretty straight forward.

Experimenting with our Vector Index

Now that we have a FAISS vector index with our embeddings, we can play around with similarity search on our databank to see the sources it provides in response to a query.

To perform similarity search on a FAISS index, we can use index.search. This method accepts two arguments: the first is the vector used to find similar vectors, and the second is how many vectors to identify. The second argument is known as "k"; FAISS similarity search is a "k-selection algorithm", meaning it finds the "k" nearest neighbors to the provided vector.

Say we wanted to find 3 documents in our nutrition databank that include information on the healthiest types of meat. How would we go about this?

We must first generate a vector embedding for our search query. Remember that similarity search compares vectors to vectors, not text to text!
We call index.search on our FAISS index. This gives us the indices of the closest document embeddings in the index, as well as their distances from the search query.
We find the document corresponding to that index.

import numpy as np

# Generate embedding for the given query
query_embedding = generate_embedding("healthiest types of meat")

# Search for similar embeddings in the index
distances, results = index.search(np.array([query_embedding]), k=3)

# Print the content of the documents
for i in results[0]:
  print(docs[i].content)

Sure enough, this prints out 3 documents that contain relevant information to what the "healthiest types of meat" are!

Similarity Threshold

What happens when we use a query that doesn't have much relevance to what's in our databank, like "best building materials"? The documents aren't super helpful—they contain information on things that are tangentially related like "urban agriculture" but nothing that can help us learn more about building materials.

This begs the question, how can we algorithmically determine whether a document is truly "similar" to the search query?

Well, we can use the distances provided by the FAISS search. If you take a look at the distances for documents similar to "healthiest types of meat", they are all much less than the distances for "best building materials". Through trial an error, we can establish a similarity threshold for this case scenario as 1.

Note that similarity thresholds will likely differ based on numerous factors, such as the quality of your databank, which similarity search and/or indexing algorithms you use, and which vector embedding model you use.

If we filter our results to make sure their distances are less than our similarity threshold, we will only get truly similar documents that are relevant to our query.

# Import required libraries
import numpy as np

# Set the similarity threshold
similarityThreshold = 1

# Generate embedding for the given query
query_embedding = generate_embedding("healthiest types of meat")

# Search for similar embeddings in the index
distances, results = index.search(np.array([query_embedding]), k=3)

# Filter the results based on the similarity threshold
filtered_results = []
for i, distance in zip(results[0], distances[0]):
  if distance <= similarityThreshold:
    filtered_results.append(i)

# Print the content of the documents
for i in filtered_results:
  print(docs[i].content)

Sometimes, there may not be any relevant content. In this case, you can say the query is outside the scope of your databank—depending on the quality and size of your databank, you may also determine that the query is not relevant to your similarity search application.

Conclusion

Vector embeddings and similarity search are incredibly powerful means for finding relevant content to a search query. By transforming textual data into vector embeddings, we are able to represent the meaning of text snippets with numbers, allowing us to perform mathematical operations on them. Similarity search then enables us to find related documents to our search query.

The process of generating vector embeddings involves using the sentence-transformers library, which offers pre-trained models that excel at vectorizing text. We can then use the FAISS library for efficient vector index creation and perform similarity search over the indexed embeddings.

Through experimentation with our vector index, we can see how similarity search can retrieve relevant documents based on a query. By establishing a similarity threshold, we can filter the results to include only truly similar and relevant documents.

In the next article in this series, we will use similarity search on our databank to provide contextual information to ChatGPT, thereby enhancing its ability to answer domain-specific questions with reliable information and citations.

DEV Community

Leveraging Vector Embeddings and Similarity Search to Supplement ChatGPT’s Training Data

Getting Started

Generate Vector Embeddings

Saving our FAISS index

Experimenting with our Vector Index

Similarity Threshold

Conclusion

Top comments (0)

Read next

Janus 1.3B: A Unified Model for Multimodal Understanding and Generation Tasks

2. Understanding Django’s Architecture: The MTV Pattern.

Must-Know Use Cases of AI and ML in Laravel Development

More storage for media : organize files using ChatGPT