DEV Community

Cover image for Text Analytics: Topic Modeling

Text Analytics: Topic Modeling

Hey people,

Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last article we saw different ways of visualizing data and under which conditions should one apply which graphic to get the required information for the interpreter, with the help of examples. That should suffice your appetite for now, but just to let you know, one can never cover the plethora of information in one blog. There is a world of creative visualization out there and it is ever-growing. You need to delve into it yourself to find it out. Feel free to look up the web and get to know a few others and I request you to share those with your fellow colleagues.

Anyways, in this article, we are going to see what Topic Modelling is, and what are the different ways of doing it with the help of examples. So let's get the ball rolling...


What is Topic Modelling?

In text analytics, we often have collections of documents, such as blog posts or news articles, that weā€™d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unlabelled documents, similar to clustering on numeric data, which finds natural groups of items even when weā€™re not sure what weā€™re looking for. It is a method of unsupervised classification of the documents, in other words, it doesnā€™t require a predefined list of tags or training data thatā€™s been previously classified by humans.

For example, let's say you are a small business and you've recently launched a product in the market and want to know what the customer is saying about it.

In this tweet, I quote:

Iā€™m no paint expert but I buffed the whole hood and applied T&T Paint protect! Very nice product.

By identifying words and expressions such as Very nice product, topic modeling can group this review with other reviews that talk about similar things. A topic classification model could also be used to determine what customers are talking about in customer reviews, open-ended survey responses, and on social media, to name just a few.


How does Topic Modeling Word?

Itā€™s simple, really. Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about. Remember, this approach is ā€˜unsupervisedā€™ meaning that no training is required.


Let's see come of the popular methods used for topic modeling. I attempt to put things down in a very lucid manner without any use of mathematics.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.

  • Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model, we could say ā€œDocument 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.ā€

  • Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for ā€œpoliticsā€ and one for ā€œentertainment.ā€ The most common words in the political topic might be ā€œPresidentā€, ā€œCongressā€, and ā€œgovernmentā€, while the entertainment topic may be made up of words such as ā€œmoviesā€, ā€œtelevisionā€, and ā€œactorā€. Importantly, words can be shared between topics; a word like ā€œbudgetā€ might appear in both equally.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that are associated with each topic, while also determining the mixture of topics that describes each document. The purpose of LDA is to map each document in our corpus to a set of topics that covers a good deal of the words in the document.

There are many pros and cons of everything we use today, so is the case with this. Although it might seem to be very easy to implement and intuitive at this point in time, the result that it yields may not be the most accurate one. After all, it is an unsupervised approach. So this is one way of topic modeling.


Latent Semantic Analysis (LSA)

LSA is yet another popularly used Topic Modeling technique in Text Analytics. The theory says that the semantics of words can be grasped by looking at the contexts the words appear in. In other words, that words that are closer in their meaning will occur in a similar excerpt of the text.

That said, LSA computes how frequently words occur in the documents ā€“ and the whole corpus ā€“ and assumes that similar documents will contain approximately the same distribution of word frequencies for certain words. In this case, syntactic information (e.g. word order) and semantic information (e.g. the multiplicity of meanings of a given word) are ignored and each document is treated as a bag of words.

A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called Singular Value Decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. [SVD is basically a factorization of the matrix. Here, we are reducing the number of rows (which means the number of words) while preserving the similarity structure among columns (which means paragraphs)]. Documents are then compared by taking the cosine of the angle between the two vectors formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Actually, LDA and LSA are a bit similar in their underlying hypothesis. But LSA is also used in text summarization, text classification, and dimension reduction.

For more information, this is a very comprehensive document, the wiki.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by taking the cosine of the angle between the two vectors formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.


This article should have given you a fair idea of what exactly Topic Modeling is, when do we use it and what are the popular ways to do it. We saw a couple of algorithms superficially just to understand how they work. We did not see any code piece as this has always been a language-agnostic, framework-agnostic tutorial.

In the next article, we shall see what is Text Classification and how does it compare with Text Modeling. That will be a very important segment in your journey to master NLP.

I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.

Until next time...

Latest comments (0)