Natural Language Processing #1: Traditional Embeddings

#datascience #nlp #machinelearning #statistics

Hello there! You are about to embark on an exciting journey of Natural language processing, covering the nuances from a programmatic and mathematical standpoint.

Natural Language Processing has been at the helm for decades, it is no secret that there has been a significant effort made during the 1980-90s to build a chatbot to communicate with a human, and give out a pre-scripted response based on the question asked.
This type of system usually called Finite State Machines (FSM) or Deterministic Finite Automation (DFA)
The major drawback of such a system was the rule-based implementation and a hierarchical if-else conditional which can be complex structure to decode and update.

The field of NLP is based on the foundation to derive embeddings from text data, and in-process understanding the semantic and syntactic pattern in the data, to carry out various tasks like:

Spelling Checker
Sentence Autocomplete
Document Summarization
Question Answering
Named Entity Recognition
Machine Translation

In this article, we will look into some of the most-used Frequency Embedding Techniques used and also divulge into the pros and the cons of it.

There are two families of methodologies to derive a word embedding :
1. Frequency-based methods
2. Prediction based methods

Frequency-based Methods

In this paradigm, a sentence is often tokenized into words, and then certain techniques are used to count the weight of the corresponding word, in turn giving us a brief idea of the usage.

Following are the schemes for frequency-based methods:

Count Vector ( Bag of Words Model)
TF-IDF Vector (Term Frequency - Inverse Document Frequency)
Co-Occurrence Vector

1. Count Vectors

This method which is popularly referred to as Bag of Words Model, which is the simplest representation of text into numeric data.

The process is as follows:

Corpus of Unique Vocabulary Words is built
Each Word in Corpus is assigned a unique index
A count number (weight) is assigned to the word in a sentence.
Vector Length of the sentence is equal to the vocabulary size of the corpus. For the words which do not fall into a sentence, the weight is assigned as 0

BoW (Bag of Words) Model can be built using scikit-learn's CountVectorizer Method

This method is not recommended since it fails to learn the semantic and syntactic structure of the sentence.

Additionally, the method also results in a sparse matrix which is difficult to compute and store.

2. TF-IDF Vectors

TF-IDF (Term Frequency- Inverse Document Frequency) is a weighing scheme that incorporates two formulas.

Term-Frequency: Measure of Occurrence of the word 't' in the
document 'd'

Inverse Document Frequency: IDF is a measure of how important a term is, that is how rare or frequent the occurrence across the documents/sentences.

Below is the code, using scikit-learn's TfidfVectorizer

TF-IDF also gives larger values for less frequent words and is high when both IDF and TF values are high i.e the word is rare in all the documents combined but frequent in a single document.

3. Co-Occurence Vectors

The big idea – Similar words tend to occur together and will have a similar context.

There are mainly two concepts to understand for building a co-occurrence matrix:

Co-occurrence
Context Window

Co-occurrence – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.

Context Window – Context window is specified by a number and the direction

It preserves the semantic relationship between words to some extent. Further down, a co-occurrence matrix can be factorized using a Truncated SVD Transformation for dense vector representations.

In conclusion, we covered three base methods for frequency-based word embeddings: BoW, tf-idf, co-occurrence matrix.