Stephen Collins

Posted on Aug 12, 2023

Introduction to Vector Search and Embeddings

#vectorsearch #embeddings #nlp #cosinesimilarity

In the age of information, extracting relevant information from massive data sets is crucial. To understand and interpret text, modern natural language processing (NLP) tools employ a technique called vectorization, where textual information is converted into numerical format. In this blog post, we'll explore how vector search and embeddings are used to find similar or relevant texts, and we'll examine a real-world code example.

What Are Embeddings?

Embeddings are the core concept behind converting text into numerical vectors. An embedding is essentially a multi-dimensional representation of a word or a group of words. These numerical representations capture the meaning and semantic relationships between words, allowing computers to "understand" text.

What Is Vector Search?

Vector search refers to finding the closest vectors in a given space that are relevant or similar to a particular query vector. It's like searching for items in a database, but instead of matching text, you're matching mathematical vectors. This search is often conducted using a measure called cosine similarity, which computes the cosine of the angle between two vectors.

Code Example: Finding Relevant Texts with Sentence Embeddings

Let's delve into a Python code example to see these concepts in action. We'll use the Sentence Transformer library to create embeddings and the sklearn library for calculating cosine similarity. We need to use numpy for sorting the results of the cosine_similarity function, openai Python package for making calls to OpenAI's API.

import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import openai
from dotenv import load_dotenv

Step 1: Setting up the Environment

First, we need to install these packages:

pip install sentence-transformers scikit-learn numpy openai python-dotenv

Next, we'll initialize our OpenAI API key we need by loading our .env file, suppressing a specific parallelism warning arising from the sentence_transformers package (not a concern for our example script). If you need an OpenAI API key, here's some instructions. For this blog post we are using the Python package dotenv to load a local .env file for simplicity.

Assuming your .env located right next to your Python script looks like:

OPENAI_API_KEY=YOUR_OPENAI_API_KEY

Then we can load our OpenAI API key:

# loads our ".env", assumes it is in the same directory as this Python script
load_dotenv()
# This script is single-threaded,
# so it's safe to suppress the TOKENIZERS_PARALLELISM warning
os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Load our OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

Step 2: Creating Text Embeddings

We use the Sentence Transformer model bert-base-nli-mean-tokens to convert example texts into numerical vectors (embeddings).

# Initialize our model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Example texts for our queries later
texts = [
    "John loves playing basketball on the weekends.",
    "Emily is a huge fan of soccer and never misses a game.",
    "Mike enjoys going golfing with his friends.",
    "Sarah's favorite sport is tennis, and she plays every Thursday.",
    "Tom and his friends are passionate about baseball and often watch games together."
]

# Create our embeddings. For this blog post, we are just loading in memory.
# I'll explain better approaches later in the post.
text_vectors = model.encode(texts)

Step 3: Defining the Search Function

We define a function get_relevant_texts to take a query string and find the relevant texts from our corpus.

# perform the vector search to find the relevant texts
def get_relevant_texts(query: str):
    query_vector = model.encode([query])
    similarities = cosine_similarity(query_vector, text_vectors)
    indices = np.argsort(similarities[0])[::-1]
    return indices[:2]

This function calculates the cosine similarity between the query vector and the text vectors and returns the indices of the two most relevant texts.

Step 4: Using OpenAI's GPT-4 for Contextual Responses

We take the found texts and construct a chat prompt to send to OpenAI's GPT-3.5-turbo model, allowing it to respond with context-aware answers.

# Encapsulate the call to OpenAI gpt-3.5-turbo model
def get_response_with_context(text: str):
    relevant_texts_indices = get_relevant_texts(text)
    relevant_texts = [texts[i] for i in relevant_texts_indices]
    content = 'text: '.join(relevant_texts) + ' user:' + text
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a QA bot given texts to answer questions."},
            {"role": "user", "content": content}],
        # Temperature set to 0 to reduce the randomness of the response.
        # Better for applications that expect consistent responses.
        temperature=0,
        max_tokens=512)
    return response.choices[0].message.content

Step 5: Testing the Code

We can then test our code with specific queries to see the relevant responses based on the context provided.

test_query = "What does John do on the weekend?"
answer = get_response_with_context(text=test_query)

# Should respond with something like:
# "John plays basketball on the weekends."
print(answer)

# Another example
test_query = "Who likes tennis?"

answer = get_response_with_context(
    text=test_query)

# Should respond with something like:
# "Sarah likes tennis."
print(answer)

Next Steps

While in this blog post I've only covered a basic introduction to using embeddings, Here are some next steps to consider:

Vector Databases: As your data grows, efficiently searching through millions of vectors can become a challenge. Specialized vector databases like FAISS, Annoy, or Elasticsearch's vector search capabilities can be explored to manage and search through large-scale vector data. Your sentence is grammatically correct. In addition, databases like SQLite and PostgreSQL have extensions, such as sqlite-vss and pgvector, that can be used to store and query vector embeddings, respectively.
More Robust Ways of Creating Embeddings: While the example provided in this blog post offers a straightforward approach to creating embeddings, various other methods and models can be explored. Pre-trained models like BERT, RoBERTa, and DistilBERT offer different characteristics and performance. Investigate these alternatives to find the optimal approach for your specific task, as well as using the OpenAI and Cohere APIs for creating vector embeddings as well.

Conclusion

Vector search and embeddings are powerful tools in modern NLP. They allow us to represent text in a format that machines can interpret, enabling us to find relevant information and derive insights, and provide one way to overcome the current limitations of the context window of large language models. The code example demonstrates how these concepts can be applied using popular libraries to search within a set of texts and obtain contextual answers. Whether it's for a chatbot or a recommendation engine, these techniques can play a vital role in various applications.

Thanks for reading this far! Here's the full example code used in this blog post:

import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import openai
from dotenv import load_dotenv

# loads our ".env", assumes it is in the same directory as this Python script
load_dotenv()
# This script is single-threaded,
# so it's safe to suppress the TOKENIZERS_PARALLELISM warning
os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Initialize our model
model = SentenceTransformer('bert-base-nli-mean-tokens')

texts = [
    "John loves playing basketball on the weekends.",
    "Emily is a huge fan of soccer and never misses a game.",
    "Mike enjoys going golfing with his friends.",
    "Sarah's favorite sport is tennis, and she plays every Thursday.",
    "Tom and his friends are passionate about baseball and often watch games together."
]

# Create our embeddings. For this blog post, we are just loading in memory.
# I'll explain better approaches later in the post.
text_vectors = model.encode(texts)

# Load our OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

# perform the vector search to find the relevant texts
def get_relevant_texts(query: str):
    query_vector = model.encode([query])
    similarities = cosine_similarity(query_vector, text_vectors)
    indices = np.argsort(similarities[0])[::-1]
    return indices[:2]

# Encapsulate the call to OpenAI gpt-3.5-turbo model
def get_response_with_context(text: str):
    relevant_texts_indices = get_relevant_texts(text)
    relevant_texts = [texts[i] for i in relevant_texts_indices]
    content = 'text: '.join(relevant_texts) + ' user:' + text
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a QA bot given texts to answer questions."},
            {"role": "user", "content": content}],
        # Temperature set to 0 to reduce the randomness of the response.
        # Better for applications that expect consistent responses
        temperature=0,
        max_tokens=512)
    return response.choices[0].message.content

test_query = "What does John do on the weekend?"
answer = get_response_with_context(text=test_query)

# Should respond with something like:
# "John plays basketball on the weekends."
print(answer)

# Another example
test_query = "Who enjoys tennis?"

answer = get_response_with_context(
    text=test_query)

# Should respond with something like:
# "Sarah enjoys tennis."
print(answer)

DEV Community