Text Analytics: Wrangling Text

#machinelearning #textanalytics #ai

Hey people,

Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last post we saw two things in particular:

Few use cases
And a typical Text Analytics Pipeline

Thanks to the examples we saw, now we should have a clearer understanding of what we are actually talking about. Not everything should seem magic now, because now we are going to look into some specific aspects of the pipeline.

In this post, we'll start demystifying key concepts in the pipeline. We are going to see what Text Wrangling is? But before that I pose a question to you, why is order required in chaos?

Well, every chaos has an underlying order associated with it. To simplify things in disorder, we start by finding some kind of order in it. Take for example what happens when your mom starts cleaning your messy room which I assume must be in the most chaotic state, at least for some of you. What does she do at first? She tries to find some order in it, first segregating the items to group together and then once this is done. It is merely placing those groups of things in the right place. Think of it, how much time would it have taken if she were to place each item one by one as she reached out to it. Sure this must have taken forever depending on how messy your room is.

The baseline is, when you try to find order in chaos, you reduce the amount of time that is required for the subsequent ones in the series to get that task done. You can apply this to any scenario. When you have a huge task at hand, it is always prudent to invest some time, and I suggest a significant time, in finding some patterns, some orders, some transformation so that the subsequent steps become easier.

I'll be more than happy to hear from all of you in the comments section some of the better examples of #findingOrderinChaos.

Text Wrangling

Generally speaking, it is the process of cleaning the data, finding inherent structure in it (or even deriving some structure), and also enriching raw data into the desired format, i.e., transforming it for better decision making in a relatively less amount of time.

The necessity for data wrangling is often a by-product of poorly collected or presented data. If one had a prescient vision of the use case, this would even be required. Thus to bring the data in the context of the use case is a very important step in the entire process. In fact, this is what defines how well your model works in a significant number of cases.

This is a very comprehensive and intuitive, self-explanatory, standard definition that goes around.

This is what Wikipedia has to say about Data Wrangling. I believe this is all that it would take to understand the crux of the concept. A worth read I suggest (at least the "Core Ideas" section).

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

View on Wikipedia>

Let's outline some of the common steps involved in the process of text wrangling.

Text Cleaning
Specific Preprocessing
Tokenization
Stemming or Lametization
Stop Word removal

Let's look at each one of them in sequence.

Text Cleaning

Once you have the raw data with you, the first step intuitively will be to make sense of the data. It is a broad term used for many common cleaning that is performed on the text. For instance, consider an HTML file as your input. A typical file would consist of a lot of markup tags, styling, some meta, and also the text you want to parse. Getting rid of the redundant data would mean getting rid of everything else in the file but the string of text that we are concerned with. A lot of languages have parsers for doing this task and it becomes relatively simpler for us using the modern-day toolkit.

In summary, any process that is done with the aim to make the text cleaner and to remove all the noise surrounding the text can be termed as text cleansing.

Specific Preprocessing

This is again a very broad term for all the kinds of operations you would like to do before getting the ball rolling. It could be something like sentence splitting where long text could be broken down into sentences based on the application. Or even something like working with punctuations or even cutting down on some redundant string. Spell Corrects and removal for specific nouns could also be done at this stage.

Tokenization

The token is the smallest processing unit that the program or machine can perform. In order to get a vector representation of the text, this is essential, also for the computer program to process it any further. Therefore, tokenization of breaking the sentence down to simple words. This can be done using various techniques depending upon the problem statement and the algorithm to be used thereafter. But, this sometimes isn't as simple as it seems and there are many configurable options in different languages to counter this.

Stemming or Lemmatization

Stemming is exactly as it sounds! It is the process of reducing the word, in any form to its root equivalent. The rule is pretty simple - to remove some of the common prefixes and suffixes. While it may be helpful on some occasions it may not be in all. For example, consider the word studying, its root word is run. Now all its variations like study, studied, studies, etc. all come under the root word umbrella of study. This is what stemming does in essence.

Lemmatization does something similar but is more methodical in a way that it converts all the inflected form of the root of the word. It makes use of the context and the parts of the speech to determine the inflected form of the words. It does the same thing stemming does with the addition of one extra step, that of check that the resulting lemma is part of the dictionary or not.

Interestingly stem might not be an actual word in the dictionary but a lemma has to be. Thus, comparatively Stemming is much faster than lemmatization. But in many applications lemmatization might be the one you need.

Stop Word removal

Consider any sentence, not necessarily the ones in English. You will find that there are many words that are not required to understand the meaning of the sentence. You can consider those words as supporting words and used to fix the grammar of the language or something like that. Typically articles and pronouns are generally classified as stop words. Some of the commonly used words are the, of, is, and so on. You can see the full list here. And mind you, this is not a very comprehensive list, based on the requirement of the application you can add more to this series. One would argue that some words are still required for the sentence to make sense. No worries, just exclude those words from the list of stop words and that's that. So you have a vector of words that now you can play with and that is quintessential to the problem you are trying to solve the application.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

In the next article, we will see what is text visualizations. Until next time...