DEV Community: Aashish Chaubey 💥⚡️

What is the difference between SQL and SQL Server (and similar tools)? 😇🎉

Aashish Chaubey 💥⚡️ — Sun, 11 Jul 2021 12:39:30 +0000

Hey people,

I am a programmer and have been one since last 3 years. I completed my B.Tech degree back in 2018 and I must admit, I was not particularly a big fan of DBMS, somehow I managed to complete the course and secure passing grades. I know it is an important subject but it never appealed me as much and therefore I never took as much interest (I always felt guilty of it).

As established, it is not an epiphany for me that it is an important subject and I must get my hand dirty with it, so I decided to pursue it and complete at least 2 projects in the next couple of weeks. This way sans Big Data (I am slowly working on that too), I will have all my grounds covered as a full stack data scientist.

Thats enough about me, now lets literally get to the title of this blog post. I know many beginners with SQL will have this question in mind. So here I am, explaining the basics as I learn. Please feel free to point out anything you feel should be corrected, I'll appreciate the feedback.

Lets first get to know what SQL is, in brief!

📝 Definition: SQL

By definition, SQL (Structured Query Language) is a query language. It is generally used with Structured Databases, or for processing data streams in real-time in relational data stream management system to query, manipulate the relational database data.

💡 Gotcha

So basically, SQL is a language. As this is a language, and a quite popular one, there are various dialects of this language. And anyone familiar with computers or even remotely acquainted with how computer works will know a language needs some kind of environment to be executed in.

It is environment that interprets the commands of the language, identifies different dialect and executes the command particular to the dialect. Let's park this environment concept for now and we will come back to it later in this post. Let's talk about SQL a little more.

SQL comprises of 3 major sub-languages:

Data Definition Language (DDL): to create and modify the structure of the database
Data Manipulation Language (DML): to perform read, insert, update, and delete operations on the data of the database
Data Control Language (DCL): to control the access of the data stored in the database

📝 Definition: SQL Server

SQL Server is proprietary software or an RDBMS tool that executes the SQL statements. It also provides some additional features and functionalities, so that the user can properly interact with the database and can perform all the database operations efficiently.

💡 Gotcha

So basically, it is a database software. It uses SQL as a language to query the database.

The popular ones include MySQL, SQL Server, Oracle, Informix, Postgres etc. They are mix of open source and proprietary software available to use.

I hope that clarifies your doubt and you are good to go ahead and understand which to use where. It should also make you clear with what you are using and what are different softwares you can compare (I know I haven't been quite comprehensive with the list, but then which softwares we use actually depends on the requirement which concerns security, control features, compatibility with the server, or the source of distribution.

People who know more about it, please share your opinion through the comments.

Thanks - until next time!

Boy or Girl Paradox... What the heck is it??? 🤯🤷

Aashish Chaubey 💥⚡️ — Tue, 29 Jun 2021 17:25:52 +0000

Hi friends,

I am a machine learning and data science enthusiast. I love playing with numbers and finding insights. I know many of you are, at least to some level. Even if you are not but you have some level of interest in mathematics, there is a great chance you will find this post interesting.

Please let me know if you guys already knew this and if there are many more which I should check out, please let me know in the comments belows. I encourage all my fellow reader to read them too.

I like solving problems, sometimes I find it on renowned platforms like Kaggle and hackerearth. Randomly I came across this seemingly easy question but which blew my mind off [partly because I feel this problem is not articulated well!]

I thought I didn't have to think much about it. Because it is just so straight forward. I mean it is given that one child is a boy, only probability of the other child being a boy is 1/2 [unless otherwise stated].

I selected the 1/2 answer from the answer, it was a wrong one. Okay, it needed a little more thinking. I wrote down all the possibilities at least one of the child is a boy [we are already given this information].

[BB, BG, GB]

Therefore, another answer which is also equally likely according to the question is 1/3. And as it turns out, this indeed is a correct answer.

But how can we have two equally correct answers for this. In my view, both the answers, as per the problem statement is correct.

I took to Google for adjudication, and again, Google didn't disappoint. Look what I found:

The Boy or Girl paradox surrounds a set of questions in probability theory, which are also known as The Two Child Problem, Mr. Smith's Children and the Mrs. Smith Problem. The initial formulation of the question dates back to at least 1959, when Martin Gardner featured it in his October 1959 "Mathematical Games column" in Scientific American. He titled it The Two Children Problem, and phrased the paradox as follows:

Mr. Jones has two children. The older child is a girl. What is the probability that both children are girls?
Mr. Smith has two children. At least one of them is a boy. What is the probability that both children are boys?

View on Wikipedia

I think it is a good read. I suggest you all to go through it (if you already haven't).

Just wanted to share this information to this erudite community. Please feel free to reach out to me on:

Twitter: https://twitter.com/AashishLChaubey
LinkedIn: https://www.linkedin.com/in/chaubey-aashish

Please checkout my Github

Dimensionality Reduction: Generalized Discriminant Analysis

Aashish Chaubey 💥⚡️ — Thu, 18 Feb 2021 19:04:17 +0000

Hey people,

Welcome to yet another exciting series of blogs. In this series, we are going to talk about dimensionality reduction techniques. Our thinking throughout this series should be oriented towards optimization. Every technique discussed here shall be keeping that in mind. However, this section will also involve some bit of math, but just to be inclusive, we shall keep it simple and straight, discussing only the concepts and how to perceive them.

In the last article we talked about the Linear Discriminant Analysis model and saw how it makes a projection of the data from the high dimensionality to a low dimensionality space. We saw it in working and also saw the use of LDA as a classifier. In this article, we are going to talk about Generalized Discriminant Analysis. Let's get it going then...

What is Generalized Discriminant Analysis?

GDA deals with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the support vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space. Similar to LDA, the objective of GDA is to find a projection for the features into a lower-dimensional space by maximizing the ratio of between-class scatters to within-class scatter. The main idea is to map the input space into a convenient feature space in which variables are nonlinearly related to the input space.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Until next time...

Dimensionality Reduction: An Introduction

Aashish Chaubey 💥⚡️ — Thu, 18 Feb 2021 18:25:07 +0000

"Every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire print collection, right? But most of it is like cat videos on YouTube or 13-year-olds exchanging text messages about the next Twilight movie."

– Nate Silver, founder and editor in chief of FiveThirtyEight.

So, what do we understand by this? Data, in contemporary times, is produced on a huge scale. The only thing limiting its growth is the storage of it. The sources of data have evolved, the ways have evolved, thanks to IoT and other technologies. This data is beautiful, we can do all kinds of operations on it and fetch insights from it. But the fundamental question is – Is all of this data useful? Do I need to process all of this data? Can I process all of this data? And most importantly, how do I process all of this data?

Let’s understand this with an example – Email Classification.

We receive a lot of emails regularly, all of these with varying importance. The least important ones are spam emails, in fact, they are unwanted and undesirable. A lot of efforts are made to prevent them from landing in our inbox. Let’s pick up one example from our inbox and contemplate on it for a moment. Each email, apart from the sender and recipient address, contains the subject – which is considered an important factor in deciding if you will open the mail or not – and the body, along with the attached documents. Look at the body of this mail, look at each of these words carefully! What is it that makes us decide if it is spam mail or not? Some special words in the spam mail create a sense of urgency and a few of these in a particular sequence make us decide that it is spam mail. So, let’s break down the body into a list (vector) of words and then understand the pattern in it and the usage of “the special words” in this email.

To achieve this goal, you construct a mathematical representation of each email as a bag-of-words vector. This is a binary vector, where each position corresponds to a specific word from an alphabet. For this email, each entry in the bag-of-words vector is the number of times a corresponding word appears in an email (0 if it does not appear at all). Assume you have constructed a bag-of-words from each email, and as a result, you have a sample of bag-of-words vectors

x_{1}, x_{2}, \dots, x_{m}

Hang on for a while and think again. Are all of the words in the email important for the identification of these special words? And what else is needed here?

However, not all words (dimensions) of your list (vectors) are informative for the spam/not spam classification. For instance, words “lottery”, “credit”, “pay” would be better variables/features for spam classification than “dog”, “cat”, “tree”. If each column were to be a word from this vector, sure, we have reduced a lot of dimensions from this data, but we apply more specialized techniques, which we shall be covering in the subsequent sections, for the dimensionality reduction.

We’ve spoken a lot about dimensionality reduction BUT what is “Dimensionality Reduction”?

Dimensionality reduction is simply, the process of reducing the dimension of your feature set. Your feature set could be a dataset with a hundred columns (i.e., features) or it could be an array of points that make up a large sphere in the three-dimensional space. Dimensionality reduction is bringing the number of columns down to say, twenty or converting the sphere to a circle in the two-dimensional space.

What problems do they pose? – The Curse of Dimensionality

Data, that is in higher dimensions generally pose a lot of problems than the data in the lower dimensions, this is what is called the curse of dimensionality. Think of image recognition problem of high-resolution images 1280 × 720 = 921,600 pixels i.e. 921600 dimensions. That is huge! Now you see why the “CURSE”?

As the number of features increases, the number of samples also increases proportionally. The more features we have, the greater number of samples we will need to have all combinations of feature values well represented in our sample.

As the number of features increases, the model becomes more complex. The more the number of features, the more the chances of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance on real data, beating the purpose.

Avoiding overfitting is a major motivation for performing dimensionality reduction. The fewer features our training data has, the lesser assumptions our model makes, and the simpler it will be. But that is not all and dimensionality reduction has a lot more advantages to offer, like

Fewer misleading data means model accuracy improves.
Fewer dimensions mean less computing. Less data means that algorithms train faster.
Fewer data means less storage space required.
Fewer dimensions allow usage of algorithms unfit for a large number of dimensions
Removes redundant features and noise.

Methods: Feature Selection and Feature Engineering

Dimensionality reduction could be done by both feature selection methods as well as feature engineering methods.

Feature selection is the process of identifying and selecting relevant features for your sample. Feature engineering is manually generating new features from existing features, by applying some transformation or performing some operation on them.

Feature selection can be done either manually or programmatically. For example, consider you are trying to build a model which predicts people’s weights and you have collected a large corpus of data that describes each person quite thoroughly. If you had a column that described the color of each person’s clothing, would that be much help in predicting their weight? I think we can safely agree it won’t be. This is something we can drop without further ado. What about a column that described their heights? That’s a definite yes. We can make these simple manual feature selections and reduce the dimensionality when the relevance or irrelevance of certain features are obvious or common knowledge. And when it’s not glaringly obvious, there are a lot of tools we could employ to aid our feature selection.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Dimensionality Reduction: Linear Discriminant Analysis

Aashish Chaubey 💥⚡️ — Thu, 18 Feb 2021 18:01:38 +0000

Hey people,

In the last article we discussed the Principal Component Analysis. Without delving into the mathematics behind it we understood the working of through intuition and small example at each step. In this article, we shall discuss yet another important dimensionality reduction technique that is ubiquitous in machine learning projects. We shall see why do we need it, where do we use it, and many more. As always let's start the discussion by asking the very fundamental question of this blog.

I suggest this article is a must-read on this topic. But still, to simplify this article a little more here we go...

What is Linear Discriminant Analysis?

If you happened to read it somewhere else or just hit Google for this, you may receive a completely unexpected answer. What I learned with my first expedition on LDA was it is a Linear Classifier. You must be acquainted with what is classifier by now and as the name suggests it is a linear classifier. It is used as an extension to Logistic Regression. But a few moments later, from another source, of course, I learned that is a dimensionality reduction technique. That was confusing! But as this article puts it together beautifully, it can actually be used as both. He has clearly stated reasons for using it either way and under what conditions can you use it. This is what he says:

the characteristics of the dataset that you are working on will guide you about the decision to apply LDA as a classifier or a dimensionality reduction algorithm to perform a classification task.

The main task of Linear Discriminant Analysis is basically to separate out the examples of classes linearly moving them into a different feature space, therefore if your dataset is linearly separable, only applying LDA as a classifier you will get great results. But if that isn't the case then Linear Discriminant Analysis (LDA) could act as an extra tool to be applied over the dataset in order to try to “make things better” or facilitate the job of the posterior classifier.

In some cases, it could be used as a reinforcement of PCA by applying linear dimensionality reduction to the reduced set of variables from PCA. This could give a much better dataset to apply the modeling on.

What happens in the Linear Discriminant Analysis technique?

The theory is pretty simple. It finds the linear combination of the predictors such the variance within the group is maximized while at the same time minimizing the variation within each group of data.

This video should help in understanding more intricate details and the working of the model.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Until next time...

Dimensionality Reduction: Principal Component Analysis

Aashish Chaubey 💥⚡️ — Thu, 18 Feb 2021 14:53:21 +0000

Hey people,

Well, just to recapitulate from the last article, dimension is nothing but the number of input variables or features for a dataset. Suppose you build an employee dataset and consider as many as 15 features of the employee to incorporate in it then the employee could be said to have 15 dimensions. And also as we have discussed, processing a large number of dimensions is not always favorable concerning space and time. This can dramatically impact the performance of machine learning algorithms, generally referred to as the "Curse of Dimensionality". Large is comparative and it depends on the data, computational resources you have, the deployment environment configurations of the model, and many more. Thus, for large datasets, we saw dimensionality reduction, i.e., reduction in the number of features or variables that were necessary for its optimal usage.

Now, there is no free lunch, at least in computing! Dimensionality reduction doesn't come without any tradeoffs! There is a good chance that you will miss out on some of the information from the data. And there is no absolute number of it, it depends on the data and its implementation. So if you are fine with the tradeoff between computational complexity and loss of information from the data, then sure dimensionality reduction can help you out in some situations.

In this tutorial, we are going to start with one such technique known as Principal Component Analysis. We shall look at the definition of it, the steps involved in it, and what is the benefit of it? So without any further ado, let's get the ball rolling...

What is Principal Component Analysis?

Principal Component Analysis (PCA) is one such technique wherein we but first build our own set of dimensions from the original dataset, do some analysis on it, and then translate our dataset to those new dimensions without losing out on much information on the data. Or at least that choice depends on you. It enables us to create and use a reduced set of variables, which are called principal factors. A reduced set is much easier to analyze and interpret. To study a data set that results in the estimation of roughly 500 parameters may be difficult, but if we could reduce these to 5 it would certainly make our day.

There are many tutorials out there on the internet as this is one of the most common techniques used for reducing complexity. I won't suggest reading all of them as they either seem to get to involved in terms of complexity or get too banal as you get along. But, this is a good-to-read article.

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

View on Wikipedia>

What are the steps involved in PCA?

Here is an outline of the important steps involved in implementing PCA:

Standardization
Computing Covariance Matrix
Computing the Eigen Vectors and Eigen Values of the covariance matrix to identify the principal components
Extracting the Feature Vector
Recasting the data along the principal component axes

Let us have a brief of each of these steps to have a clearer understanding of this topic without getting mathematical.

Standardization

When you have a large data set, there are many columns and a large number of records. Each variable fluctuates in a particular range. For example, the speed of the car, cannot be negative and cannot cross a particular value either. It has to be within a particular limit eg., 0 - 200 Km/hr. But the temperate of the engine fluctuates in another range, eg., 0-50 °C. Similarly, there could be many other variables. In order to judge them with respect to each other, we must bring them down to scale. For example, subtract each number by the mean of that feature in the data and divide it by the standard deviation (the dispersion of a dataset relative to its mean).

Mathematically speaking,

z = \frac{v a l u e - m e an}{s t an d a r d d e v ia t i o n}

Once the standardization is done, all the variables will be transformed to the same scale.

Computing Covariance Matrix

Have you ever had this experience of someone asking your birth year and then immediately being asked your age? I guess so. Does both these question provide you equally significant things about me or is one redundant in context? This is what the Covariance matrix aims to eliminate within the data. It tries to figure out how closely the values are related to each other and establishes a relation between them. It tries to figure out the redundancy in the data. For a p dimensional data, you will have a p*p covariance matrix. Also, the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.

Computing the Eigen Vectors and Eigen Values of the covariance matrix to identify the principal components

I know this has to get all mathematical but, we will try to get around it by just understanding the concept and not working out the math behind it.

The first thing to understand here is that we decompose the matrix. Why do we do it you ask? We do it to break it into its constituent elements. Constituent elements of the matrix tell us more about what's significant in it. There are many approaches to do that but perhaps the most opted method is that of Eigen decomposition of the Covariance matrix which results in having Eigenvector and Eigenvalues.

Now that we have constituent vectors along with their transformational value with us all the mathematical operations can be performed on it so as to squeeze maximum information out of the given data as a linear combination to form principal components. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second, and so on.

Organizing information in principal components this way will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.

Look at this image, it is prevalent when you look up the PCA on the internet.

As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set. For example, let’s assume that the scatter plot of our data set is as shown below, can we guess the first principal component? Yes, it’s approximately the line that matches the purple marks because it goes through the origin and it’s the line in which the projection of the points (red dots) is the most spread out. Or mathematically speaking, it’s the line that maximizes the variance (the average of the squared distances from the projected points (red dots) to the origin). And we keep doing so until we covered all the principal components in that data.

Extracting Feature Vectors

Now that we have all the Principal Vectors arranged in descending order, thanks to eigenvalues. We can select as many features as we like to reduce the complexity in the data. In this step, what we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.

Recast data along the Principal Component axes

Till now, we were performing our analysis on the data, did not actually perform any kind of transformations on the data. We now use the feature vector that we have deduced and transform it along the principal components. This can be done by multiplying the transpose of the original data set by the transpose of the feature vector. Here is a more mathematical version of it:

R es u lt an t D a t a = F e a t u re V ec t o r^{T} * St an d a r d i ze d D a t a se t^{T}

This is how the data is transformed and the complexity in the data is reduced using Principal Component Analysis.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Until next time...

Text Analytics: Text Classification

Aashish Chaubey 💥⚡️ — Thu, 18 Feb 2021 05:51:25 +0000

Hey people,

Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last article we saw what is topic modeling, what are the different ways to do it in a very simple superficial way. We saw why that was important. Since the last article, if you've noticed, we've started putting the smaller thing together and started to see the big picture of what these NLP applications are. You might not even realize but you are on course to write your first sophisticated algorithm. Well, from the point of view of this tutorial series, you might not see that happening in terms of code but the knowledge will sure ease the way. You just need to pick up a language and get yourself up to speed.

In this article, we are going to talk about yet another interesting and important topic in NLP. It is called Text Classification. We will see what that is, what are the different ways to do it among many others through examples and complete the pipeline. So let's get ourselves going...

What is Text Classification?

The simplest definition of text classification is that it is a classification of text based on the content of that text. It can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies and files, and all over the web. For example, new articles can be organized by topics; support tickets can be organized by urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment; and so on.

Consider the following example:

“The movie was boring and slow!”

A classifier can take this text as an input, analyze its content, and then automatically assign relevant tags, such as boring and slow that represent this text.

Now, to understand the processes of text classification, let's take a real word problem of spam. Daily, you receive so many emails related to different activities you are involved in. You must also get some spam emails, that's not new for any one of us. Have you ever meticulously analyzed spam mail? Tried to figure out if there is a general structure or a pattern in it? I am sure many of us would've tried this activity at least once. Ok, you haven't I'm the procrastinator who used to do such things.

So you see, there are two classes you can classify your mail into based on the usability of the mail, they are spam and not spam. Let's see an example of each of these classes:

Spam: "Dear Customer, we are excited to inform you that your account is eligible for a $1000 reward. To avail click the link below now!!!"
Not Spam: "Dear customer, this is to inform you that our services will be temporarily restricted between 12:00 AM to 4:00 AM for maintenance purposes. We request you to please avoid using the services."

Once you have the data, and a significant number of the records, you can start with the tasks typical to the NLP application that we read in the earlier articles like tokenization, lemmatization, and stop words removal. You could any language of your choice and depending on it you would have an output more or less something like below:

Spam: "dear customer excited inform account eligible 1000 reward avail click link"
Not Spam: "dear customer inform service temporarily restricted 12:00 4:00 maintenance purpose request please avoid using service"

Once you can clean your data to this kind of representation, it can be something even more advanced than this or a bit more primitive than this, totally based on the data you have, and finally, what do we expect from the data. Anyways, now we must remember that machine learning algorithms work with numerical data only. In fact, the computer is designed to work with numbers. And so, we got to represent our modified textual data to the numerical type somehow.

The mapping from textual data to real-valued vectors is called feature extraction. There are many ways to do that, one of the most common ways is Bag of Words (BoW). It is a representation of text that describes the occurrence of words within a document. It involves two things:

A vocabulary of known words.
A measure of the presence of known words.

It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where it is in the document. The intuition is that documents are similar if they have similar content.

Jack be nimble
Jack be quick
Jack jump over 
The candlestick

This snippet consists of 4 lines in all. Let us, for this example, consider each line to be a separate document and thus we have a collection of documents with each document having few words in it.

Now let us design the document:

In this collection of documents, we have 8 unique words out of the 11 total words.

Remember, the objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.

One simple way is to mark the presence of words as boolean values, 1 if present, 0 otherwise.

In that way, our first document will look something like this.

[1, 1, 1, 0, 0, 0, 0, 0]

Consider this array as a vocabulary index boolean array. Similarly, you would do it for the rest of the documents. It would look something like this:

"Jack be nibmle" = [1, 1, 0, 1, 0, 0, 0, 0]
"Jack jump over" = [1, 0, 0, 0, 1, 1, 0, 0]
"The candlestick" = [0, 0, 0, 0, 0, 0, 1, 1]

Good, now there is a problem with this you see. The matrix that you have as of now is dominated by 0. It is also known as a sparse matrix. The poses a problem for computation with respect to space and time. We must condense it. There are again many ways to do that, one of the most favored ways of doing that in NLP is using n-grams. This has to be one of the easiest concepts. An n-gram is the N-token sequence of words. A 2-word n-gram, commonly known as a bigram could be a 2-word string.

For example, consider the first document "Jack be quick". It will be like this: ["Jack be", "be quick"].

You see now we have a shorter number of elements to match against to produce the Bag of Words.

Still, the problem of sparse matrix persists. It can be mitigated by using Singular Value Decomposition (SVD). This is a very comprehensive tutorial if you got to go into technical details.

One of the major disadvantages of using BOW is that it discards word order thereby ignoring the context and in turn meaning of words in the document. For natural language processing (NLP) maintaining the context of the words is of utmost importance. To solve this problem we use another approach called Word Embedding.

Word Embedding is a representation of text where words that have the same meaning have a similar representation. There are various models for textual representation using this paradigm but the most popular ones are Word2Vec and Glove and ELMo.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

View on Wikipedia>

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

View on Wikipedia>

However, once you have your representation of text ready as numbers and matrices. The next step is to choose a Machine Learning model for text classification. Now there is no hard and fast rule as to which model performs the best, it depends upon a lot of factors such as the data, the computational resources, the time complexity, the use case, the end-users, and the device on which they will be using this on. There are many intricate details in each of the algorithms. But usually, these models are grouped into two:

Machine Learning models
Deep Learning models

Here is a comprehensive list for both of them:

Machine Learning

Multinomial Naïve Bayes (NB)
Logistic Regression (LR)
SVM (SVM)
Stochastic Gradient Descent (SGD)
k-Nearest-Neighbors (kNN)
RandomForest (RF)
Gradient Boosting (GB)
XGBoost (the famous) (XGB)
Adaboost
Catboost
LigthGBM
ExtraTreesClassifier

Deep Learning

Shallow Neural Network
Deep neural network (and 2 variations)
Recurrent Neural Network (RNN)
Long Short Term Memory (LSTM)
Convolutional Neural Network (CNN)
Gated Recurrent Unit (GRU)
CNN+LSTM
CNN+GRU
Bidirectional RNN
Bidirectional LSTM
Bidirectional GRU
Recurrent Convolutional Neural Network (RCNN) (and 3 variations)
Transformers

You have a number of metrics on which you can decide the performance of these models on your data. Some of them are Precision, Recall, F1 Score, Confusion Matrix, AUC, ROC AUC, ROC AUC, ROC Curves, Cohen's Kappa, True/False Positive Rate curve, and so on.

You now have a complete picture and a good pipeline to get you started.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Until next time...

Text Analytics: Topic Modeling

Aashish Chaubey 💥⚡️ — Tue, 16 Feb 2021 10:10:50 +0000

Hey people,

Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last article we saw different ways of visualizing data and under which conditions should one apply which graphic to get the required information for the interpreter, with the help of examples. That should suffice your appetite for now, but just to let you know, one can never cover the plethora of information in one blog. There is a world of creative visualization out there and it is ever-growing. You need to delve into it yourself to find it out. Feel free to look up the web and get to know a few others and I request you to share those with your fellow colleagues.

Anyways, in this article, we are going to see what Topic Modelling is, and what are the different ways of doing it with the help of examples. So let's get the ball rolling...

What is Topic Modelling?

In text analytics, we often have collections of documents, such as blog posts or news articles, that we’d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unlabelled documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for. It is a method of unsupervised classification of the documents, in other words, it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

For example, let's say you are a small business and you've recently launched a product in the market and want to know what the customer is saying about it.

// Detect dark theme var iframe = document.getElementById('tweet-1360355827731881986-479'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1360355827731881986&theme=dark" }

In this tweet, I quote:

I’m no paint expert but I buffed the whole hood and applied T&T Paint protect! Very nice product.

By identifying words and expressions such as Very nice product, topic modeling can group this review with other reviews that talk about similar things. A topic classification model could also be used to determine what customers are talking about in customer reviews, open-ended survey responses, and on social media, to name just a few.

How does Topic Modeling Word?

It’s simple, really. Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about. Remember, this approach is ‘unsupervised’ meaning that no training is required.

Let's see come of the popular methods used for topic modeling. I attempt to put things down in a very lucid manner without any use of mathematics.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.

Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model, we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the political topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that are associated with each topic, while also determining the mixture of topics that describes each document. The purpose of LDA is to map each document in our corpus to a set of topics that covers a good deal of the words in the document.

There are many pros and cons of everything we use today, so is the case with this. Although it might seem to be very easy to implement and intuitive at this point in time, the result that it yields may not be the most accurate one. After all, it is an unsupervised approach. So this is one way of topic modeling.

Latent Semantic Analysis (LSA)

LSA is yet another popularly used Topic Modeling technique in Text Analytics. The theory says that the semantics of words can be grasped by looking at the contexts the words appear in. In other words, that words that are closer in their meaning will occur in a similar excerpt of the text.

That said, LSA computes how frequently words occur in the documents – and the whole corpus – and assumes that similar documents will contain approximately the same distribution of word frequencies for certain words. In this case, syntactic information (e.g. word order) and semantic information (e.g. the multiplicity of meanings of a given word) are ignored and each document is treated as a bag of words.

A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called Singular Value Decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. [SVD is basically a factorization of the matrix. Here, we are reducing the number of rows (which means the number of words) while preserving the similarity structure among columns (which means paragraphs)]. Documents are then compared by taking the cosine of the angle between the two vectors formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Actually, LDA and LSA are a bit similar in their underlying hypothesis. But LSA is also used in text summarization, text classification, and dimension reduction.

For more information, this is a very comprehensive document, the wiki.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

View on Wikipedia>

This article should have given you a fair idea of what exactly Topic Modeling is, when do we use it and what are the popular ways to do it. We saw a couple of algorithms superficially just to understand how they work. We did not see any code piece as this has always been a language-agnostic, framework-agnostic tutorial.

In the next article, we shall see what is Text Classification and how does it compare with Text Modeling. That will be a very important segment in your journey to master NLP.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Until next time...

Text Analytics: Text Visualization

Aashish Chaubey 💥⚡️ — Tue, 16 Feb 2021 03:47:41 +0000

Hey people,

Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last article we saw what is data wrangling in the textual context. It was a comprehensive guide to understanding how to prepare data ready for use and be fed to the algorithms. Most of the cases, as we've seen before, is the most time-consuming step. It requires a lot of understanding of the data.

For a data scientist, it is a good thing. You would have a holistic view of the data before putting it to work. But what if you had to paint your understanding in someone else's mind, which is yet another skill quintessential to a data scientist. As they say, "A picture is worth more than a thousand words". In this section, we are going to understand what text visualization is, what are the different ways, how exactly do we do that among other fundamentals. So use this post to satisfy your curiosity. Let's cut to the chase then...

Look at that above graphic. Looks so neat and so full of information. And this is just one of the random images you find when browsing through the net. It has so much story to tell, so much experience to narrate.

But what is this text visualization?

Text Visualization, also known as information graphics, is a powerful tool that informs and educates the readers. In a typical Natural Language Processing project, a lot of resources are needed, a lot of data is digested, a lot of communication takes place, a lot of computing is required and all this takes a lot of days, if not months to get it through. But when it comes to explaining to people about your work, could be your professors, your boss, your partner, or any common person for that matter, only the data in visual form is trusted. One could make long stories about the work they've put in but if there isn't something substantial, how would one be able to prove his claims? Not just for the credibility, this has become an integral part of the model building.

This can be helpful when exploring and getting to know a dataset and can help with identifying patterns, corrupt data, outliers, and much more. With a little domain knowledge, data visualizations can be used to harness the real value of the data. Sometimes it is commonly used in the process of Exploratory Data Analysis (EDA) but I feel it is much more than that. It gives you the power to exploit data at any stage of the application. Could be used for debugging the model, cross-validating the claims, presentations among many others.

That's good, but what are the different types of charts available, and when are they to be used?

Let's have a look at some of the basic plots used commonly by Data Analysts and Data Scientists irrespective of numerical data or textual data. Since this is a language-agnostic tutorial, we shall not be taking the help of any language reference.

Line Chart
Bar Chart
Histogram Plot
Box and Whisker Plot
Scatter Plot

With knowledge of these plots, you can quickly get a qualitative understanding of most data that you come across.

Line Chart

Use: A line plot is generally used to present observations collected at regular intervals.

Scale: The x-axis represents the regular interval, such as time. The y-axis shows the observations, ordered by the x-axis and connected by a line.

Example: Line plots are useful for presenting time series data as well as any sequence data where there is an ordering between observations.

Bar Chart

Use: A bar chart is generally used to present relative quantities for multiple categories.

Scale: The x-axis represents the categories and is spaced evenly. The y-axis represents the quantity for each category and is drawn as a bar from the baseline to the appropriate level on the y-axis.

Example: Bar charts can be useful for comparing multiple point quantities or estimations.

Histogram Plot

Use: A histogram plot is generally used to summarize the distribution of a data sample.

Scale: The x-axis represents discrete bins or intervals for the observations. For example observations with values between 1 and 10 may be split into five bins, the values [1,2] would be allocated to the first bin, [3,4] would be allocated to the second bin, and so on. The y-axis represents the frequency or count of the number of observations in the dataset that belong to each bin.

Example: Histograms are valuable for summarizing the distribution of data samples.

Note: Often, careful choice of the number of bins can help to better expose the shape of the data distribution. The number of bins can be specified by setting the “bins” argument.

Box and Whisker Plot

Use: A box and whisker plot, or boxplot for short, is generally used to summarize the distribution of a data sample.

Scale: The x-axis is used to represent the data sample, where multiple boxplots can be drawn side by side on the x-axis if desired. The y-axis represents the observation values. Lines called whiskers are drawn extending from both ends of the box calculated as (1.5 x IQR) to demonstrate the expected range of sensible values in the distribution. Observations outside the whiskers might be outliers and are drawn with small circles.

Example: Boxplots are useful to summarize the distribution of a data sample as an alternative to the histogram. They can help to quickly get an idea of the range of common and sensible values in the box and in the whisker respectively.

Note: A box is drawn to summarize the middle 50% of the dataset starting at the observation at the 25th percentile and ending at the 75th percentile. This is called the interquartile range or IQR.

Scatter Plot

Use: A scatter plot (or ‘scatterplot’) is generally used to summarize the relationship between two paired data samples.

Scale: The x-axis represents observation values for the first sample, and the y-axis represents the observed values for the second sample. Each point on the plot represents a single observation.

Example: Scatter plots are useful for showing the association or correlation between two variables. A correlation can be quantified, such as a line of best fit, that too can be drawn as a line plot on the same chart, making the relationship clearer.

All of these could be used at different stages while performing an Exploratory Data Analysis. Some examples could be a frequency plot, a list of top 10 n-grams or parts of speech, a Sentiment polarity boxplot of a particular class using a box plot, and so on. These could be applied only if we either have numerical data or at least are able to represent the textual data somehow in a numerical format.

These are the general ones that, irrespective of numerical or textual data, is used very often.

But which ones are used particularly for the textual data?

There are few others used extensively for the textual data. For example, a Word Cloud. Let's see it with an example.

Word Cloud

This has to be inarguably the most common visualization method used when dealing with data that is textual in nature. It brings a lot of information to the reader. Think of it, how do you make out what kind of text you are reading when you have to be quick? You glance through and look for the most common words in a passage and find out its context. This gives you a fair bit of idea of what you are reading or what it might be about. This is the exact kind of idea this chart gives you in a cleaner way.

It makes words stand out either by means of font size or color according to their usage frequency. Text analysis results in the form of a word cloud can show the theme of texts obviously if the presumption that more important words appear more often is taken to be true.

Consider this example:

The several most important words are “literature”, “project”, “media”, “texts” and “data”. One can quickly make a decision if he wants to clean the data more, or selecting a subsequent strategy just with a look at this chart. One can even make a conclusion on the quality of data. Such handy it could turn out to be!

There are many different similar kinds of visualization techniques that could be used as the word maps and network chats but the crux remains the same.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Thanks for being with me, until next time...

Text Analytics: Wrangling Text

Aashish Chaubey 💥⚡️ — Mon, 15 Feb 2021 16:31:24 +0000

Hey people,

Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last post we saw two things in particular:

Few use cases
And a typical Text Analytics Pipeline

Thanks to the examples we saw, now we should have a clearer understanding of what we are actually talking about. Not everything should seem magic now, because now we are going to look into some specific aspects of the pipeline.

In this post, we'll start demystifying key concepts in the pipeline. We are going to see what Text Wrangling is? But before that I pose a question to you, why is order required in chaos?

Well, every chaos has an underlying order associated with it. To simplify things in disorder, we start by finding some kind of order in it. Take for example what happens when your mom starts cleaning your messy room which I assume must be in the most chaotic state, at least for some of you. What does she do at first? She tries to find some order in it, first segregating the items to group together and then once this is done. It is merely placing those groups of things in the right place. Think of it, how much time would it have taken if she were to place each item one by one as she reached out to it. Sure this must have taken forever depending on how messy your room is.

The baseline is, when you try to find order in chaos, you reduce the amount of time that is required for the subsequent ones in the series to get that task done. You can apply this to any scenario. When you have a huge task at hand, it is always prudent to invest some time, and I suggest a significant time, in finding some patterns, some orders, some transformation so that the subsequent steps become easier.

I'll be more than happy to hear from all of you in the comments section some of the better examples of #findingOrderinChaos.

Text Wrangling

Generally speaking, it is the process of cleaning the data, finding inherent structure in it (or even deriving some structure), and also enriching raw data into the desired format, i.e., transforming it for better decision making in a relatively less amount of time.

The necessity for data wrangling is often a by-product of poorly collected or presented data. If one had a prescient vision of the use case, this would even be required. Thus to bring the data in the context of the use case is a very important step in the entire process. In fact, this is what defines how well your model works in a significant number of cases.

This is a very comprehensive and intuitive, self-explanatory, standard definition that goes around.

This is what Wikipedia has to say about Data Wrangling. I believe this is all that it would take to understand the crux of the concept. A worth read I suggest (at least the "Core Ideas" section).

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

View on Wikipedia>

Let's outline some of the common steps involved in the process of text wrangling.

Text Cleaning
Specific Preprocessing
Tokenization
Stemming or Lametization
Stop Word removal

Let's look at each one of them in sequence.

Text Cleaning

Once you have the raw data with you, the first step intuitively will be to make sense of the data. It is a broad term used for many common cleaning that is performed on the text. For instance, consider an HTML file as your input. A typical file would consist of a lot of markup tags, styling, some meta, and also the text you want to parse. Getting rid of the redundant data would mean getting rid of everything else in the file but the string of text that we are concerned with. A lot of languages have parsers for doing this task and it becomes relatively simpler for us using the modern-day toolkit.

In summary, any process that is done with the aim to make the text cleaner and to remove all the noise surrounding the text can be termed as text cleansing.

Specific Preprocessing

This is again a very broad term for all the kinds of operations you would like to do before getting the ball rolling. It could be something like sentence splitting where long text could be broken down into sentences based on the application. Or even something like working with punctuations or even cutting down on some redundant string. Spell Corrects and removal for specific nouns could also be done at this stage.

Tokenization

The token is the smallest processing unit that the program or machine can perform. In order to get a vector representation of the text, this is essential, also for the computer program to process it any further. Therefore, tokenization of breaking the sentence down to simple words. This can be done using various techniques depending upon the problem statement and the algorithm to be used thereafter. But, this sometimes isn't as simple as it seems and there are many configurable options in different languages to counter this.

Stemming or Lemmatization

Stemming is exactly as it sounds! It is the process of reducing the word, in any form to its root equivalent. The rule is pretty simple - to remove some of the common prefixes and suffixes. While it may be helpful on some occasions it may not be in all. For example, consider the word studying, its root word is run. Now all its variations like study, studied, studies, etc. all come under the root word umbrella of study. This is what stemming does in essence.

Lemmatization does something similar but is more methodical in a way that it converts all the inflected form of the root of the word. It makes use of the context and the parts of the speech to determine the inflected form of the words. It does the same thing stemming does with the addition of one extra step, that of check that the resulting lemma is part of the dictionary or not.

Interestingly stem might not be an actual word in the dictionary but a lemma has to be. Thus, comparatively Stemming is much faster than lemmatization. But in many applications lemmatization might be the one you need.

Stop Word removal

Consider any sentence, not necessarily the ones in English. You will find that there are many words that are not required to understand the meaning of the sentence. You can consider those words as supporting words and used to fix the grammar of the language or something like that. Typically articles and pronouns are generally classified as stop words. Some of the commonly used words are the, of, is, and so on. You can see the full list here. And mind you, this is not a very comprehensive list, based on the requirement of the application you can add more to this series. One would argue that some words are still required for the sentence to make sense. No worries, just exclude those words from the list of stop words and that's that. So you have a vector of words that now you can play with and that is quintessential to the problem you are trying to solve the application.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

In the next article, we will see what is text visualizations. Until next time...

Text Analytics - The Pipeline (Part 1)

Aashish Chaubey 💥⚡️ — Thu, 08 Oct 2020 09:39:34 +0000

Hello people,

Another post, another plunder in the realm of AI and the linguists. If you’ve followed till here, you’ve built on good fundamentals of Text Analytics. In the last article we saw some of the business use cases of this technology and most importantly what we saw is a brief overview, a thousand feet view, of a typical Text Analytics pipeline.

That’s a good start. But now what’s critical is this, and the subsequent piece of articles, where we bridge the theory and practical implementation gaps, challenges, and constraints. Beginning with this article, we shall see some of the important stages - individually - in a typical pipeline. Let's start!

Data Acquisition

In contemporary times, getting a dataset is not considered a major task in itself, is it?

It is true that in the good old days when there was a major paucity in data sources across most of the domains, fetching data for a specific use case was considered a herculean task. One had to scrape tons of discrete sites, for very little data. But today with the abundance of data, all consolidated either in one place or available as an API (Tweepy - Twitter, ImgurPython - Imgur, praw - Reddit, Kaggle, Facebook, etc.), data acquisition can be safely discarded off the impediment list before the project even starts. Most of the research papers also provide the link to the source where they picked the data from, maybe an archive, a governmental portal, a conference portal or database, etc. These are the major sources of data among others.

In this stage, the data is consolidated using different scraping techniques and no data is discarded, even based on the quality. This becomes very useful for the preliminary analysis of the textual data and to fetch some insights. One can then subsequently start to clean the data in the following states.

Data Preperation & Data Wrangling/Text Wrangling

"As a data scientist, one spends over 70% of his time cleaning and unifying messy data so that you can perform operations on them."

I'm sure some of us must be passionate about cooking, at least a few of us? Well, what is the basic doctrine, as far as the experts are concerned, for cooking a special dish? It says the more attention you pay in preparing and cooking the ground spices (or as they call it, masala), the more palatable it gets. It holds good in most kinds of AI applications also. The more time you spend on maturing your data sets, the healthier outcome it will yield. The relation as simple as that.

Although data preparation is a multi-faceted task, text wrangling is basically the pre-processing work that's done to prepare raw text data to fit for training and be efficiently used in the successive data analytics stage. In this stage, we convert and transform information (textual data) at different levels of granularity based on the requirement of the application. Text wrangling applies large scale changes to the text, by automating some low-level transformations. The basic approach to text wrangling is to work with lines of text.

Exclusively speaking, data preprocessing focuses on any kind of preprocessing of the textual data before you build an analytical model. On the other hand, data wrangling is used in Exploratory Data Analysis (EDA) and Model building. This is done to adjust the data sets iteratively while analyzing the data and building the model. I'd put them down in one bracket because they are semantically related and the kind of thing we expect at the end of this stage is very much related. Also, the choice of one affects the result of another. So in most cases, they are very tightly bound. It deals with playing with data, getting insights, and bringing it in a particular format that is considered appropriate or suitable for feeding it into the model.

Basically, some of the essential common steps in text wrangling are:

- Text Cleansing

This can be any simple preprocessing of the text. It can include dumping of some textual data not binding to the requirement or shortening the length, removing emojis (based on the application), and so on. But the gist is this, we get rid of all the basic unwanted things not required in the data from the raw data.

- Sentence Splitting

Splitting the text into individual sentences, a task as simple as that. Now the splitting could be based on various criteria that can be provided in the APIs of various libraries available for this task. There are many state-of-the-art libraries out there to achieve it.

- Tokenization

A token is the smallest text unit a machine can process. As a matter of fact, to run natural language programs on the data it needs to be tokenized. Thus, in most cases, it makes sense that the smallest unit be either a word or a letter. We can further tokenize the word into letters, depending on the application.

- Stemming or Lemmatization

We know many English words in many different forms and many times it has the same semantic sense. Stemming is exactly what it sounds like - stemming or limiting the word to its root form of the inflected words. Let us understand this by an example, consider the word writing. It is made of its root word write. It can further be expressed in many different forms based on time and context like wrote, writes, etc. But we, with our general understanding of this language, know that all these form convey the same thing. It is therefore a good idea to reduce the word to its basic form.

Lemmatization is similar to stemming but a little more stringent than it. The difference is that the word obtained from stemming may not be an actual word, like the ones in the dictionary. It can generate some arbitrary words. But in the case of lemmatization, the words thus produced has to necessarily a real word or a word from the dictionary. This makes lemmatization a little slower than stemming as there is this added responsibility of validating the word in the dictionary.

- Stop Word Removal

Let's first understand what are stop words? They are nothing but a set of commonly used words in a language. One major part of preprocessing is filtering out useless data. This useless data in Natural Language Processing is known as "stop-words" or also called filler words, words which have less importance than the other words in a sentence. There are a set of identified words in the English Langauge, and other languages too, that are included in the popular NLP libraries. Some of them are a, the, have, he, i, and many more. It makes intuitive sense to drop out these words and focus on the other important words in the sentence in order to do something useful.

Hope you all liked it. In the next article, we are going to see other stages in the pipeline. It will be for the best if you also follow the next part of the series to get a thorough understanding of the topic. Please let me know if you want me to work on a few more details of the topic, I will make sure to include those. Until next time. #BiDen

Text Analytics - A gentle Introduction

Aashish Chaubey 💥⚡️ — Wed, 07 Oct 2020 12:22:56 +0000

Hi people,

Welcome back to another narrative in our quest to understand the fundamentals of Text Analytics. Now that we have laid the foundation stone with a small example in the last post, let's cut to the chase. Let's deal with the intricacies of it.

Many machine learning enthusiasts and fanatics would have already looked up for this term and undeniably there is a plethora of information on the web related to this. You might have come across terms like data mining, text analysis, and text analytics. And while to the untrained mind these might sound like synonyms, from the point of view of practice and experience, there is a subtle difference worth mentioning.

What is Text Analysis/data mining?

Text Analysis is the term describing the very process of computational analysis of texts. It is the automated process of understanding and sorting unstructured text, making it easier to manage and mine for valuable insights. This is very often interchanged with data mining and it is just fine to do so.

What is Text Analytics?

Text Analytics, on the other hand, involves a set of techniques and approaches towards bringing textual content to a point where it is represented as data and then mined for insights/trends/patterns.

TL;DR: Case in point, Text Analysis helps translate a text in the language of data. And it is when Text Analysis “prepares” the content, that Text Analytics kicks in to help make sense of these data.

Why use it?

Colossal amounts of unstructured data are generated every minute -- internet users post 456,000 new tweets, 510,000 new comments on Facebook, and send 156 million emails -- so managing and analyzing information to find what’s relevant becomes a major challenge.

Thanks to text analytics, businesses can automatically extract meaning from all sorts of unstructured data, from social media posts and emails to live chats and surveys, and turn it into quantitative insights. By identifying trends and patterns with text analytics, businesses can improve customer satisfaction (by learning what their customers like and dislike about their products), detect product issues, conduct market research, and monitor brand reputation, among other things.

Text analytics has many advantages – it’s scalable, meaning you can analyze large volumes of data in a very short time and allows you to obtain results in real-time. So, apart from gaining insights that help you make confident decisions, you can also resolve issues promptly.

How do NLP and Text Analytics relate?

Text Analytics is an artificial intelligence (AI) technology that uses Natural Language Processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms. So in other words, NLP is just one of the multitude of ways used for carrying out text analytics.

NLP is growing in importance and adoption in the community of linguists because

It is very efficient in handling large volumes of text data.
Equally good in structuring highly unstructured data sources.

And lately, it is started delivering on the huge promises toward a seamless system.

Glossary

Unstructured data: data stored in its native format and not processed until it is used eg., documents, e-mails, blogs, digital images, videos, and satellite imagery.
Computational analysis: Mathematical models used to numerically study the behavior of complex systems employing a computer simulation.

In the next article, we are going to see what are some of the popular business use cases of Text Analytics and what exactly is a typical Text Analytics pipeline (several stages of an application)?