Exploring Text Similarity with OpenAIEmbeddings and Dot Products in Python

Embeddings play a crucial role in modern Natural Language Processing (NLP) by representing words or sentences as vectors in a high-dimensional space. The idea is simple yet powerful: similar texts should have similar embeddings, and the similarity can be measured using operations like dot products. In this blog post, we’ll explore how to compute text similarity using OpenAIEmbeddings from the LangChain framework and calculate the dot product to compare embeddings.

Why Use Embeddings and Dot Products?

Embeddings provide a powerful way to capture the semantic meaning of text, which is essential for a wide range of NLP tasks, such as t*ext classification, question answering, and even search engines*.

The dot product is one of the simplest methods for comparing these embeddings, providing a quick way to measure how similar two pieces of text are.

This technique is particularly useful in applications like:

Recommendation Systems: Recommending content based on text similarity.

Search Engines: Retrieving documents similar to a query.
Chatbots and Virtual Assistants: Understanding user queries by comparing them to known phrases.

Let’s dive into the code and see how this works.

To begin, you’ll need to install the necessary packages, including LangChain,Langchain-openai, numpy, and python-dotenv to manage environment variables.

pip install langchain langchain-openai numpy python-dotenv

We’re using python-dotenv to load environment variables (such as API keys). Ensure you have your .env file ready with your OpenAI API key, which is necessary for generating embeddings.

from langchain_openai import OpenAIEmbeddings
import numpy as np
from dotenv import load_dotenv, find_dotenv

load_dotenv()

embeddings = OpenAIEmbeddings()

line1 = "The cat is sleeping on the sofa."
line2 ="The cat is resting on the couch."
line3 ="Sreeni"

# The embeddings for each of these lines are generated using embed_query().


line1_embed = embeddings.embed_query(line1)
line2_embed = embeddings.embed_query(line2)
line3_embed = embeddings.embed_query(line3)

print(np.dot(line1_embed, line2_embed))
print(np.dot(line1_embed,line3_embed))

The similarity between The cat is sleeping on the sofa. and The cat is resting on the couch. is 0.9670626687198663
The similarity between The cat is sleeping on the sofa. and Sreeni is 0.7573476651283388

The first dot product compares the embeddings of "line2" and "line2", which should result in a higher similarity score, as these two sentences are semantically related.

The second dot product compares the embeddings of "line1" and "line3" this dot product will yield a much lower score, indicating a lower or less similarity.

In this blog, we explored how to use OpenAIEmbeddings from LangChain to generate embeddings for text and compare them using dot products. This approach is a simple yet effective method for measuring text similarity, which can be applied to a wide range of NLP tasks. With just a few lines of code, you can start leveraging the power of embeddings in your applications.

Thanks
Sreeni Ramadurai