DEV Community

Cover image for Word Embeddings
Wildan Aziz
Wildan Aziz

Posted on • Edited on

Word Embeddings

What are word embeddings?

Word embeddings are a type of word representation used in natural language processing (NLP) and machine learning. They involve mapping words or phrases to vectors of real numbers in a continuous vector space. The idea is that words with similar meanings will have similar embeddings, making it easier for algorithms to understand and process language.

Here’s a bit more detail on how it works:

  1. Vector Representation: Each word is represented as a vector (a list of numbers). For example, the word "north" is represented as [0.30059, 0.55598, -0.040589, ...].
  2. Semantic Similarity: Words that have similar meanings are mapped to nearby points in the vector space. So, "north" and "south" are close to each other, while "north" and "season" are further apart.
  3. Dimensionality: The vectors are usually of high dimensionality (e.g., 50 to 300 dimensions). Higher dimensions can capture more subtle semantic relationships, but also require more data and computational resources.

Pre trained word embeddings

Pre-trained word embeddings are vectors that represent words in a continuous vector space, where semantically similar words are mapped to nearby points. They’re generated by training on large text corpora, capturing syntactic and semantic relationships between words. These embeddings are useful in natural language processing (NLP) because they provide a dense and informative representation of words, which can improve the performance of various NLP tasks.

What examples of pre-trained word embeddings?

  1. Word2Vec: Developed by Google, it represents words in a vector space by training on large text corpora using either the Continuous Bag of Words (CBOW) or Skip-Gram model.
  2. GloVe (Global Vectors for Word Representation): Developed by Stanford, it factors word co-occurrence matrices into lower-dimensional vectors, capturing global statistical information.
  3. FastText: Developed by Facebook, it builds on Word2Vec by representing words as bags of character n-grams, which helps handle out-of-vocabulary words better.

Visualizing pre-trained word embeddings can help you understand the relationships and structure of words in the embedding space.

  1. Import libraries.
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.manifold import TSNE
    
    
  2. Download pre-trained GloVe word vectors here.
    
    # Load pre-trained GloVe embeddings
    embeddings_index = {}
    try:
        with open('glove.6B.50d.txt', encoding='utf-8') as f:
            for line in f:
                values = line.split()
                word = values[0]
                coefs = np.asarray(values[1:], dtype='float32')
                embeddings_index[word] = coefs
    except:
        raise("Something error when opening the file.")
    
    words = list(embeddings_index.keys())[:200]
    words_vectors = np.array([embeddings_index[word] for word in words])
    
    tsne = TSNE(n_components=2, random_state=0, perplexity=30, max_iter=300)
    word_vectors_2d = tsne.fit_transform(words_vectors)
    
    
  3. Visualizing the word embeddings.
    
    # Convert numpy array to dataframe
    df = pd.DataFrame({
        'x': word_vectors_2d[:, 0],
        'y': word_vectors_2d[:, 1],
        'word': words
    })
    
    # Set Seaborn style
    sns.set(style='whitegrid')
    
    # Create the plot
    plt.figure(figsize=(5, 1.9), dpi=1000)
    plt.grid(True)
    ax = sns.scatterplot(data=df, x='x', y='y', hue='word', palette='tab20', legend=None, s=25, alpha=0.7)
    
    # Customize plot appearance
    ax.set(xlabel=None)
    ax.set(ylabel=None)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.set_xlim(-15, 15)  # Adjust these values as needed
    ax.set_ylim(-15, 15)
    
    # disabling xticks and yticks by Setting xticks to an empty list
    plt.xticks([])
    plt.yticks([])
    
    # Show plot
    plt.savefig('Word Embeddings 2D.png', bbox_inches='tight', transparent=True, dpi=1000)
    plt.show()
    
    

Conclusion

  1. Clustering of Words: Words that are semantically similar in meaning (according to their embeddings) tend to appear closer together in the plot. For instance, words like "king" and "queen" might form a cluster, while words like "dog" and "cat" would cluster separately in a different region.
  2. Dimensionality Reduction: The word embeddings, which are typically high-dimensional vectors (e.g., 300-dimensional from Word2Vec or GloVe), have been reduced to 2D. While the original high-dimensional relationships are complex, t-SNE tries to preserve the local structures in the reduced 2D space, which may cause groups of words that were close in high dimensions to appear near each other in this plot.
  3. Interpretation: Each point on the plot represents a word, and the distance between points reflects the similarity between the words in their original high-dimensional vector space. Clusters of words likely share semantic similarities, while isolated points might represent words that are more unique or less contextually related to others in the data set.

Top comments (0)