Word Embeddings

#python #machinelearning #datascience #nlp

What are word embeddings?

Word embeddings are a type of word representation used in natural language processing (NLP) and machine learning. They involve mapping words or phrases to vectors of real numbers in a continuous vector space. The idea is that words with similar meanings will have similar embeddings, making it easier for algorithms to understand and process language.

Here’s a bit more detail on how it works:

Vector Representation: Each word is represented as a vector (a list of numbers). For example, the word "north" is represented as [0.30059, 0.55598, -0.040589, ...].
Semantic Similarity: Words that have similar meanings are mapped to nearby points in the vector space. So, "north" and "south" are close to each other, while "north" and "season" are further apart.
Dimensionality: The vectors are usually of high dimensionality (e.g., 50 to 300 dimensions). Higher dimensions can capture more subtle semantic relationships, but also require more data and computational resources.

Pre trained word embeddings

Pre-trained word embeddings are vectors that represent words in a continuous vector space, where semantically similar words are mapped to nearby points. They’re generated by training on large text corpora, capturing syntactic and semantic relationships between words. These embeddings are useful in natural language processing (NLP) because they provide a dense and informative representation of words, which can improve the performance of various NLP tasks.

What examples of pre-trained word embeddings?

Word2Vec: Developed by Google, it represents words in a vector space by training on large text corpora using either the Continuous Bag of Words (CBOW) or Skip-Gram model.
GloVe (Global Vectors for Word Representation): Developed by Stanford, it factors word co-occurrence matrices into lower-dimensional vectors, capturing global statistical information.
FastText: Developed by Facebook, it builds on Word2Vec by representing words as bags of character n-grams, which helps handle out-of-vocabulary words better.

Visualizing pre-trained word embeddings can help you understand the relationships and structure of words in the embedding space.

Import libraries.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE

Download pre-trained GloVe word vectors here.


# Load pre-trained GloVe embeddings
embeddings_index = {}
try:
    with open('glove.6B.50d.txt', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
except:
    raise("Something error when opening the file.")

words = list(embeddings_index.keys())[:200]
words_vectors = np.array([embeddings_index[word] for word in words])

tsne = TSNE(n_components=2, random_state=0, perplexity=30, max_iter=300)
word_vectors_2d = tsne.fit_transform(words_vectors)

Visualizing the word embeddings.


# Convert numpy array to dataframe
df = pd.DataFrame({
    'x': word_vectors_2d[:, 0],
    'y': word_vectors_2d[:, 1],
    'word': words
})

# Set Seaborn style
sns.set(style='whitegrid')

# Create the plot
plt.figure(figsize=(5, 1.9), dpi=1000)
plt.grid(True)
ax = sns.scatterplot(data=df, x='x', y='y', hue='word', palette='tab20', legend=None, s=25, alpha=0.7)

# Customize plot appearance
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_xlim(-15, 15)  # Adjust these values as needed
ax.set_ylim(-15, 15)

# disabling xticks and yticks by Setting xticks to an empty list
plt.xticks([])
plt.yticks([])

# Show plot
plt.savefig('Word Embeddings 2D.png', bbox_inches='tight', transparent=True, dpi=1000)
plt.show()

Conclusion

Clustering of Words: Words that are semantically similar in meaning (according to their embeddings) tend to appear closer together in the plot. For instance, words like "king" and "queen" might form a cluster, while words like "dog" and "cat" would cluster separately in a different region.
Dimensionality Reduction: The word embeddings, which are typically high-dimensional vectors (e.g., 300-dimensional from Word2Vec or GloVe), have been reduced to 2D. While the original high-dimensional relationships are complex, t-SNE tries to preserve the local structures in the reduced 2D space, which may cause groups of words that were close in high dimensions to appear near each other in this plot.
Interpretation: Each point on the plot represents a word, and the distance between points reflects the similarity between the words in their original high-dimensional vector space. Clusters of words likely share semantic similarities, while isolated points might represent words that are more unique or less contextually related to others in the data set.

DEV Community

Word Embeddings

What are word embeddings?

Pre trained word embeddings

What examples of pre-trained word embeddings?

Conclusion

Top comments (0)

Read next

AI generates 4D textured scenes from text with video diffusion models

Free-Lunch Explainable AI: Mesomorphic Networks Fuse Deep Learning and Linear Models for Tabular Data

Thursday Quiz

Unearthing Universal Feature Geometries: Sparse Autoencoders Reveal Crystal-like and Modular Structures