DEV Community

Ravi
Ravi

Posted on

Exploring Word Embeddings: python implementation of Word2Vec and GloVe in Vector Databases

Word embeddings like Word2Vec and GloVe are powerful techniques to convert words into continuous vector representations. These vectors capture semantic relationships between words, making them useful for various applications, including vector databases.

Example of Using Word Embeddings with Python

We'll cover how to generate word embeddings using Word2Vec and GloVe, and then store these embeddings in a vector database (like FAISS or Annoy) for efficient similarity searches.

Step 1: Install Required Libraries

First, make sure you have the required libraries installed. You can install them via pip:

pip install gensim faiss-cpu
Enter fullscreen mode Exit fullscreen mode

Step 2: Generate Word Embeddings

Using Word2Vec

Here's how to generate word embeddings using Word2Vec:

import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK resources
nltk.download('punkt')

# Sample text data
sentences = [
    "Natural language processing is a fascinating field.",
    "Word embeddings are useful for semantic search.",
    "Gensim is a popular library for topic modeling and embeddings.",
]

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Save the model
word2vec_model.save("word2vec.model")
Enter fullscreen mode Exit fullscreen mode

Let's break down the provided code step by step to understand its purpose and functionality:

  • Importing Libraries:

    • gensim is a library for topic modeling and document similarity analysis.
    • Word2Vec is a specific model within Gensim for creating word embeddings.
    • word_tokenize from the NLTK (Natural Language Toolkit) library is used for breaking sentences into individual words (tokens).
    • nltk is the library that provides various tools for natural language processing.
  • Downloading NLTK Resources: This line downloads the necessary tokenizer resources from NLTK, which is needed for the word_tokenize function to work.

  • Sample Text Data: Here, a list of sentences is defined to serve as the training data for the Word2Vec model. This data contains different aspects of natural language processing and the Gensim library.

  • Tokenizing Sentences: This line processes each sentence in the sentences list:

    • It converts the sentence to lowercase to ensure uniformity.
    • word_tokenize breaks the sentence into individual words, resulting in a list of tokenized sentences.
  • Training the Word2Vec Model: This line creates and trains a Word2Vec model using the tokenized sentences.

    • vector_size=100: Sets the dimensionality of the word vectors to 100.
    • window=5: Defines the context window size, meaning the model will consider 5 words before and after a target word to learn its context.
    • min_count=1: Ensures that words appearing at least once are included in the model. (In practice, a higher value is often used to filter out rare words.)
    • workers=4: Specifies the number of CPU threads to use during training, allowing for faster processing.
  • Saving the Model: This line saves the trained Word2Vec model to a file named "word2vec.model", allowing you to load and use it later without retraining.

Using GloVe

To use GloVe, you'll need to install the glove-python-binary package:

pip install glove-python-binary
Enter fullscreen mode Exit fullscreen mode

Here's how to generate GloVe embeddings:

from glove import Corpus, Glove

# Create a corpus from the tokenized sentences
corpus = Corpus()
corpus.fit(tokenized_sentences, window=5)

# Train GloVe model
glove_model = Glove(no_components=100, learning_rate=0.05)
glove_model.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)

# Save the model
glove_model.save("glove.model")
Enter fullscreen mode Exit fullscreen mode
  • Importing Libraries: This line imports the Corpus and Glove classes from the glove library, which is used for generating GloVe (Global Vectors for Word Representation) embeddings.

  • Creating a Corpus: This line creates an instance of the Corpus class. A corpus is a collection of text that will be used to train the GloVe model.

  • Fitting the Corpus: This line trains the Corpus object using the tokenized_sentences, which is a list of tokenized words from your text data. The window parameter specifies the size of the context window (the number of words to consider before and after a target word). A larger window means more context is taken into account.

  • Creating a GloVe Model: This line creates an instance of the Glove class. The no_components parameter specifies the dimensionality of the word vectors (in this case, 100 dimensions), and learning_rate sets the initial learning rate for the model training.

  • Training the GloVe Model: This line fits the GloVe model to the matrix created from the corpus.

    • corpus.matrix provides the co-occurrence matrix of words, which is used to train the embeddings.
    • epochs specifies the number of training iterations (30 in this case).
    • no_threads indicates the number of CPU threads to use for training (4 threads).
    • verbose=True means that the training process will output progress messages.
  • Saving the Model: This line saves the trained GloVe model to a file named "glove.model". This allows you to load the model later for generating embeddings or performing other tasks without retraining.

Step 3: Store and Query Word Embeddings in a Vector Database

For this example, we will use FAISS to create a simple vector database and perform similarity searches.

Using FAISS

import numpy as np
import faiss

# Get word vectors from the Word2Vec model
word_vectors = word2vec_model.wv
word_list = list(word_vectors.index_to_key)
word_embeddings = np.array([word_vectors[word] for word in word_list]).astype('float32')

# Create FAISS index
index = faiss.IndexFlatL2(word_embeddings.shape[1])  # L2 distance
index.add(word_embeddings)

# Function to find the top n similar words
def find_similar_words(word, n=3):
    if word in word_vectors:
        word_vector = word_vectors[word].reshape(1, -1).astype('float32')
        distances, indices = index.search(word_vector, n)
        return [(word_list[i], distances[0][j]) for j, i in enumerate(indices[0])]
    else:
        return []

# Example query
similar_words = find_similar_words('language')
print("Similar words to 'language':", similar_words)
Enter fullscreen mode Exit fullscreen mode

Let's break down the provided code step by step to understand its purpose and functionality:

Let's break down the provided code step by step to understand its purpose and functionality:

  • Importing Libraries: This line imports NumPy (for numerical operations) and FAISS (Facebook AI Similarity Search), a library optimized for efficient similarity search and clustering of dense vectors.

  • Accessing Word Vectors: This code retrieves the word vectors from a previously trained Word2Vec model.

    • word_vectors contains the actual embeddings for each word.
    • word_list creates a list of words (the vocabulary) based on their indices.
  • Creating a NumPy Array of Embeddings: This line constructs a NumPy array (word_embeddings) containing the word vectors for all the words in the vocabulary. The vectors are converted to the float32 data type for compatibility with FAISS.

  • Creating a FAISS Index: This line initializes a FAISS index for performing similarity searches.

    • IndexFlatL2 creates a flat (non-hierarchical) index that uses L2 distance (Euclidean distance) to measure similarity between vectors.
    • word_embeddings.shape[1] specifies the dimensionality of the vectors.
  • Adding Embeddings to the Index: This line adds all the word embeddings to the FAISS index, allowing for efficient similarity search operations.

  • Defining a Similarity Search Function: This function, find_similar_words, takes a word and the number of similar words to return (n).

    • It first checks if the word is in the word vectors.
    • If the word exists, it retrieves its corresponding vector, reshapes it to a 2D array, and converts it to float32.
    • The index.search method is used to find the n most similar words based on L2 distance, returning both the distances and indices of the closest words.
    • The function then constructs a list of tuples containing the similar words and their distances.
  • Executing a Query: This code calls the find_similar_words function with the word "language" and prints out the similar words along with their distances.

To conclude, This code demonstrates how to:

  1. Generate word embeddings using Word2Vec and GloVe.
  2. Store these embeddings in a FAISS vector database.
  3. Perform similarity searches to find words that are semantically similar.

You can adjust the sample text and query words to see how the embeddings capture different relationships.

Top comments (0)