NLP Visualized: Part 1 – Build Word Clouds to Understand Your Text

#nlp #machinelearning #devto #python

What is word cloud?

A word cloud, also known as a tag cloud or text cloud, is a visual representation of text data where the size of each word corresponds to its frequency or importance within the text. Larger, bolder words indicate higher frequency, making it easy to quickly grasp the most prominent themes or keywords in a piece of text.

What we are going to do?

Import nltk library
Initialise corpus
Removing stop words
Create vocabulary from that corpus.

Import Library

from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords

Stop Words Removal

Initialise corpus

corpus='''India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[25] is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.'''

corpus=corpus.replace("[25]","")
corpus=corpus.replace("[f]","")
corpus=corpus.replace(")","")

Removing Stop Words

words=[]
for word in word_tokenize(corpus):
    if(word.lower() not in stopwords.words('english')) and (len(word)>=2):
        words.append(word.lower())
print(words)

Output:
['india', 'officially', 'republic', 'india', 'hindi', 'bhārat', 'gaṇarājya', 'country', 'south', 'asia', 'seventh-largest', 'country', 'area', 'second-most', 'populous', 'country', 'populous', 'democracy', 'world', 'bounded', 'indian', 'ocean', 'south', 'arabian', 'sea', 'southwest', 'bay', 'bengal', 'southeast', 'shares', 'land', 'borders', 'pakistan', 'west', 'china', 'nepal', 'bhutan', 'north', 'bangladesh', 'myanmar', 'east', 'indian', 'ocean', 'india', 'vicinity', 'sri', 'lanka', 'maldives', 'andaman', 'nicobar', 'islands', 'share', 'maritime', 'border', 'thailand', 'myanmar', 'indonesia']

Creating Vocabulary

vocab=list(set(words))
print(len(vocab))

Output:
48

DEV Community