Debapriya Das

Posted on Jul 21

Exploring Word Embedding Techniques Based on Count or Frequency: A Practical Guide

#nlp #python #scikitlearn #coding

In the rapidly evolving field of Natural Language Processing (NLP), word embeddings are essential for converting text into numerical representations that algorithms can process. This article delves into three primary word embedding techniques based on count or frequency: One-Hot Encoding, Bag of Words (BoW), and Term Frequency-Inverse Document Frequency (TF-IDF), with practical Python implementations using scikit-learn.

1. One-Hot Encoding

Overview

One-Hot Encoding is a fundamental technique where each word in the vocabulary is represented as a binary vector. In this representation, each word is assigned a unique vector with a single high (1) value and the rest low (0).

Example

For a vocabulary of ["cat", "dog", "mouse"], the one-hot vectors would be:

"cat": [1, 0, 0]
"dog": [0, 1, 0]
"mouse": [0, 0, 1]

Code Example

Here’s how you can implement One-Hot Encoding using scikit-learn:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample vocabulary
vocab = np.array(["cat", "dog", "mouse"]).reshape(-1, 1)

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the vocabulary
onehot_encoded = encoder.fit_transform(vocab)

# Print the one-hot encoded vectors
print(onehot_encoded)

Output

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Advantages

Simple and easy to implement.
Suitable for small datasets.

Disadvantages

High dimensionality for large vocabularies.
Does not capture semantic relationships between words.

Use Cases

Basic text classification tasks.
Simple NLP applications where semantic context is not crucial.

2. Bag of Words (BoW)

Overview

Bag of Words represents text by the frequency of words, disregarding grammar and word order. This technique constructs a vocabulary of known words and counts their occurrences in the text.

Example

For the sentences "The cat sat on the mat" and "The dog lay on the mat", the BoW representation would be:

"The cat sat on the mat": [1, 1, 1, 1, 2]
"The dog lay on the mat": [1, 1, 1, 1, 2]

Code Example

Here’s how you can implement Bag of Words using scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
sentences = ["The cat sat on the mat", "The dog lay on the mat"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

Output

['cat' 'dog' 'lay' 'mat' 'on' 'sat' 'the']
[[1 0 0 1 1 1 2]
 [0 1 1 1 1 0 2]]

Advantages

Simple and effective for various tasks.
Suitable for text classification and document similarity.

Disadvantages

High dimensionality with large vocabularies.
Loses semantic and contextual information.

Use Cases

Document classification.
Spam detection and sentiment analysis.

3. Term Frequency-Inverse Document Frequency (TF-IDF)

Overview

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF).

Formula

Term Frequency (TF): Measures how frequently a word appears in a document.

Formula: $[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} ]$

Inverse Document Frequency (IDF): Measures how important a word is across multiple documents.

Formula: $[ \text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) ]$

TF-IDF: Product of TF and IDF.

Formula: $[ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) ]$

Example

For a document set with "The cat sat on the mat" and "The dog lay on the mat":

"mat" may have a lower weight if it appears frequently in many documents.
"cat" and "dog" would have higher weights as they appear less frequently.

Code Example

Here’s how you can implement TF-IDF using scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
sentences = ["The cat sat on the mat", "The dog lay on the mat"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

Output

['cat' 'dog' 'lay' 'mat' 'on' 'sat' 'the']
[[0.         0.         0.         0.46979135 0.58028582 0.58028582
  0.46979135]
 [0.         0.58028582 0.58028582 0.46979135 0.46979135 0.
  0.46979135]]

Advantages

Highlights important words while reducing the weight of frequently occurring but less informative words.
Effective in filtering out common words and emphasizing unique terms.

Disadvantages

Still results in high-dimensional sparse matrices.
Does not capture semantic relationships between words.

Use Cases

Information retrieval and search engines.
Document clustering and topic modeling.

Conclusion

Count or frequency-based word embedding techniques like One-Hot Encoding, Bag of Words, and TF-IDF are foundational methods in NLP. While they are straightforward to implement and useful for various text processing tasks, they have limitations in capturing semantic relationships and handling large vocabularies. As the field advances, more sophisticated embedding techniques are emerging, offering richer and more nuanced representations of textual data.

I hope this article provides a clear and professional overview of count or frequency-based word embedding techniques with practical implementations. Happy learning! 🚀

Here is the notebook containing the examples: https://github.com/Debapriya-source/NLP-Codes/blob/main/word_embedding.ipynb

DEV Community

Exploring Word Embedding Techniques Based on Count or Frequency: A Practical Guide

1. One-Hot Encoding

Overview

Example

Code Example

Output

Advantages

Disadvantages

Use Cases

2. Bag of Words (BoW)

Overview

Example

Code Example

Output

Advantages

Disadvantages

Use Cases

3. Term Frequency-Inverse Document Frequency (TF-IDF)

Overview

Formula

Example

Code Example

Output

Advantages

Disadvantages

Use Cases

Conclusion

Top comments (0)

Read next

QtWidgets and QtCore

Introducing the DNA-KEY System: Taking the Password Generator to the Next Level! 🔐

Any Country's capital finder in Python

Building Maintainable Python Applications with Hexagonal Architecture and Domain-Driven Design