Vimal Sheoran

Posted on Jul 4, 2023

Text to Vectors: Representing texts for your ML tasks

#machinelearning #ai #chatgpt #nlp

Introduction

LLM is the technology that fuels ChatGPT, Bard and other well known language models. But, have you ever wondered how do these models actually process all the textual information we provide to them? Any kind of machine learning, deep learning or statistical learning platform understands only numbers, then how do these LLMs or any language model per say understand sentences, code and even different languages! In this post we shall look into it.

Understanding Text

Consider a sample text Quick Brown Fox Jumps Over The Wall. This here is a sentence and it can be split into the following tokens Quick, Brown, Fox, Jumps, Over, The and Wall. Tokens basically mean the words which form a sentence.

For any language processing task, we can be sure that we will be working with multiple text inputs, these can be sentences, paragraphs, documents, etc. Each of these individual units of textual data are called documents.

An exhaustive collection of all available tokens from all documents in a dataset is called vocabulary.

Vectors for Text -- How is that even possible?

I know for a lot of us vectors look something like this ai + bj + ck. This is a valid 3-D representation of vectors but, when we convert words to vectors we need to go beyond 3-D. How does that happen?

From a 3-D perspective a vector is defined across the 3 fundamental axis, X, Y and Z and they represent orientation in a 3-D space. Now I want you to consider each token of your sentence as a fundamental axis of the final representation vector. Consider the following example sentence, Quick Brown Fox Jumps Over The Wall then it's vector would be in a space comprising of Quick, Brown, Fox etc as fundamental axes. However there is still a lot that will go into representing words as vectors.

The Bag of Words Model

A bag of words model offers representation of a piece of textual data say a sentence or a document as a set of the words/tokens it contains. The model does not maintain the grammatical essence of the text or even the order of appearance of the tokens.

For example the following text, The movie was superb. can be broken down into the following tokens The, movie, was and superb. Now the text can be represented as one instance of The, one instance of movie, one instance of was and one instance of superb. This is the bag of words model and we start our vector representation of text data by building upon this intuition.

One Hot Encoding

One Hot Encoding is the simplest representation of texts into vectors. Consider the following set of documents,

The movie was superb.
I liked the movie.
The movie was okayish.

Firstly let's analyze the vocabulary of these documents. The vocabulary would consist of the, movie, I, was, superb, liked and okayish. The size of the vocabulary is 7 since there are 7 unique words. Now that we have this information, we can start constructing our One Hot Encoded vectors for each of these sentences.

The above image will help us understand how these vectors are formed,

All the tokens in your vocabulary will be considered while forming the vector. So the size of the vector is equal to the size of the vocabulary.
For a given document in the dataset its vector will have value 1 for all the tokens from the vocabulary present in the document and zero for all others.

e.g I liked the movie -> 1.I + 1.liked + 1.the + 1.movie + 0.was + 0.superb + 0.

Problems with One Hot Encoding

The vectors that are formed as a result of One Hot Encoding are sparse and large in nature. Let's say you have a vocabulary containing 1000 words but, each of your input document only has 20 tokens at max, then, for each document you'll still have a 1000 dimensional vector with 880 zeroes and 20 ones, this makes the whole thing sparse and takes up a lot of space.
All words are given equal weightage. Consider the above example I liked the movie when we consider a One Hot Encoder all the tokens in the sentence get a weight of 1 however, the word which actually conveys a positive feeling in this entire sentence is liked which should ideally have more weightage than the other words.
One hot encoding does not preserve any relationship of a token in a document with other tokens. This means that the OHE output does not preserve the order or pattern in which words appear in a sentence.

Managing Dimensionality with Preprocessing

With one hot encoded vectors, we see that the vector size is really large. However, we can do some preprocessing on the text data to reduce the size of these vectors.

Stems and Lemmas

Individual words/tokens hold great meaning in the context of textual data. Consider the following example,
A: It was an amazingly boring movie.
B: It was an amazingly interesting movie.
Both the sentences are similar in their make up but, one single word (boring/interesting) changes the entire meaning.

Stemming is a heuristic approach that removes prefixes and suffixes to approximate the base form. For example consider these words, Amaze, Amazed, Amazing, Amazingly. When we stem it we remove all suffixes from a base word that is common in all these words, which in this case is Amaz.

Lemmatization is a process that reduces words to their base or dictionary form, known as the lemma. Unlike stemming, lemmatization considers the context and part of speech (POS) to generate accurate lemmas. Consider the words, Historical, Historically and Historian. The lemma of these words would be History.

From above, we can see that stemming a word may not always return a word that holds meaning e.g Amaze stems to Amaz which is not an actual English word. This is why we introduce the concept of lemmas. A lemma is a base word from which a word originates e,g Historical, Historian -> History. Here History is a lemma from which all its forms originate.

Both stemming and lemmatization are important steps while converting a given text into its vector representation (Don't worry we will talk about vector representations real soon).

Stopwords

When we talk about sentences, when trying to analyze them or use them in a machine learning task, we often want to optimize the input by providing only the tokens (words) that make sense or carry any kind of computational significance. Let's take this example,

I like pizza if I write this as like pizza you know that both the sentences are talking about some person or someone that likes pizza. Now removing I doesn't really change the meaning in a computational sense, of course it changes the grammatical sense of it but, not in a computational sense.

These kind of tokens/words are known as stopwords, when designing an NLP system we preprocess our input data to remove these stopwords and also stem/lemmatize our tokens to turn them into a base form. Once these preprocessing steps are performed we can start talking about vectorizing our input texts.

Term Frequency and Inverse Document Frequency (TF-IDF)

TF-IDF is one of the most common vectorizer models that is being used today. One of the key differentiators of TF-IDF from the One Hot Encoding model is that TF-IDF is able to weigh tokens based on their relevance in the entire set of documents. But, what is TF-IDF?

Let's break it down into two,

Term Frequency: Term Frequency (TF) of a token, is the number of time, a token is repeated in a document, divided by the total number of tokens in a document. Note, all of these tokens are obtained after preprocessing the document.
Inverse Document Frequency: Inverse Document Frequency (IDF) of a token, is the logarithm of total number of documents in the set divided by the total number of documents that contain the token.
TF-IDF: TF-IDF is simply the multiplication of TF and IDF value for a particular token.

Now, let's see how we can actually construct TF-IDF vectors for a given set of documents, let's consider the following corpus of documents,

I like all kinds of fruits.
I like all kinds of fruits but, I like pineapple more.
I do not like any kind of fruit.

This is the initial corpus of documents that we have, let's apply some preprocessing on it, we would remove stopwords, lower case all words and stem the plural tokens into singular. The preprocessed set will look like this,

like kind fruit
like kind fruit like pineapple
not like kind fruit

Now if we extract the vocabulary from the text, we would get the following words,
like, kind, fruit, pineapple, not

Lets compute the TF values for all these tokens in each sentence

The IDF values for each of these tokens is

The final TF-IDF vectors obtained by multiplying the TF and IDF values are as follows,

Why does this work?

The job of TF is to determine the importance of a token within a given document. In the above case only the second document I like all kinds of fruits but, I like pineapple more. has a token that is repeated twice i.e like. Now if a token is repeated in a document that means it might carry more weight than other tokens of the document and this is measured by TF.

IDF calculates how important a token is in the entire set of documents. If we reference our IDF table above, we see that the tokens like, kind and fruit have 0 IDF value this is because they are present in each document and so they do not provide any distinctive information about that document compared to the others. On the other hand pineapple and not have the highest IDF values because these two differentiate the their documents from others.

When we combine these values together we get our final TF-IDF vectors. Where we can combine the effect of tokens in a given document with their effect on the entire set.

Pros and Cons of TF-IDF

From what we have seen above, TF-IDF is really good at calculating the individual contribution of tokens to a document's uniqueness. Thus it can help in determining the similarity or dissimilarity between two different documents. Hence, TF-IDF is still widely used for the task of text classification.

Since TF-IDF is able to weigh individual tokens within a document, it can also help us reduce the dimensionality of the final vector by removing all tokens that do not have a TF-IDF value greater than a certain threshold thus providing more dense vectors than One Hot Encoding.

The con of TF-IDF is that it cannot capture any semantic relationship between the words, it only focuses on tokens and their relevance rather than the order or pattern in which these appear.

Context Injection in the form of N-Grams

When we were talking about TF-IDF one of the issues that arises is the fact that it focuses on individual tokens and skips out on the context in which a token is being used. There is a neat trick to inject context into TF-IDF model.

Instead of considering only individual tokens to represent our TF-IDF vector space, we start concatenating tokens to create additional features for example, tokens such as good and movie can be combined together to create a new feature good movie now we have some context along with our tokens. We can concatenate as many tokens as we like if we combine two tokens its a Bi-gram, if we combine three tokens its a Tri-gram and for N number of tokens, it is an N-Gram model.

Word2Vec

Consider this problem, you have a context e.g I really _____ the movie and you want to predict the token in the blank. In this case lets say the token is liked. How do we go around doing that? Word2Vec is one of the answers to that problem.

Let's start by consider our vocabulary which in this case consists of the following tokens, I, really, liked, the and movie. For all of these, we can create a one hot encoded vector from the entire vocabulary. These one hot encoded vectors would act as input to our neural network that we will use to predict a token, given a context. In our case we want to predict the word liked given the rest of the words.

To predict the token liked in this case, we create four inputs, 2 inputs for tokens before liked and 2 inputs for tokens after liked.

Let's understand this image,

We provide our 4 context tokens as one hot encoded inputs.
These are then provided to an embedding layer that contains N neurons in it.
The size of the embedding layer determines the dimensionality of the embedding vector.
The embedding layer converts one hot encoded vectors into a single vector representation.
Then this embedded representation is sent forward to a connected network which predicts the actual word.

This model of predicting a word from a context is called the Continuous Bag of Words (CBOW). The inverse problem i.e predicting context from a given word is called Skipgram.

How to leverage Word2Vec to create Word Embeddings?

You can sample a very large corpus of text like the Wikipedia Database. You can preprocess this set to create a large vocabulary of tokens and break the dataset down into documents.

Once you have these documents, you can apply a CBoW or Skipgram model to these documents to train for token or context prediction task. Once the model converges, you can isolate the embedding layer from it and use it to then embed any token or sentence.

Of course this is a bird's eye view of how you can go on and create word embeddings for yourself. Further reading on the topic is advised to understand the actual implementation.

Note: Word2Vec converts individual words to vectors. To convert entire documents/sentences to vectors you can refer to Sentence2Vec which functions similarly.

Pros and Cons of Word Embeddings

Word Embeddings created by using models like Word2Vec are very dense as compared to TF-IDF and OHE approaches. Here the size of the embedding is determined by size of the embedding layer. When embeddings are created in this manner, they also consider the context which surrounds a token and not just the token itself.

Word Embeddings have lots of use cases. They can be used for document distance finding, cosine similarity, text generation, text summarisation and creating LLMs!

One of the issues with these embeddings is that they may not generalize well for all datasets and tasks. They are more performant on the kind of data they have been trained on. Example, if an embedding model is trained on a corpus of Historical text it may not work well on tasks associated with a corpus of Medical text.

Final Thoughts

The world is going crazy over LLMs but, really if there are no embeddings, there are no LLMs. We've discussed the very basic notion of embedding starting from one hot encodings all the way upto more practical forms of embeddings like TF-IDF and Word2Vec but, there is a whole world of them out there like GloVe, BERT, etc. There is no single embedding that works well so when you try to use embeddings for your next NLP task you can test out a few of these and see which works the best for you.

The same ideas of embedding texts can be utilized to embed other things like code, audio, images and also different languages to create a whole slew of applications. So keep learning and exploring. Cheers!

DEV Community