What is word cloud?
A word cloud, also known as a tag cloud or text cloud, is a visual representation of text data where the size of each word corresponds to its frequency or importance within the text. Larger, bolder words indicate higher frequency, making it easy to quickly grasp the most prominent themes or keywords in a piece of text.
What we are going to do?
- Import nltk library
- Initialise corpus
- Removing stop words
- Create vocabulary from that corpus.
Import Library
from nltk.tokenize import word_tokenize,sent_tokenize from nltk.corpus import stopwords
Initialise corpus
corpus='''India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[25] is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.''' corpus=corpus.replace("[25]","") corpus=corpus.replace("[f]","") corpus=corpus.replace(")","")
Removing Stop Words
words=[] for word in word_tokenize(corpus): if(word.lower() not in stopwords.words('english')) and (len(word)>=2): words.append(word.lower()) print(words)
Output: ['india', 'officially', 'republic', 'india', 'hindi', 'bhārat', 'gaṇarājya', 'country', 'south', 'asia', 'seventh-largest', 'country', 'area', 'second-most', 'populous', 'country', 'populous', 'democracy', 'world', 'bounded', 'indian', 'ocean', 'south', 'arabian', 'sea', 'southwest', 'bay', 'bengal', 'southeast', 'shares', 'land', 'borders', 'pakistan', 'west', 'china', 'nepal', 'bhutan', 'north', 'bangladesh', 'myanmar', 'east', 'indian', 'ocean', 'india', 'vicinity', 'sri', 'lanka', 'maldives', 'andaman', 'nicobar', 'islands', 'share', 'maritime', 'border', 'thailand', 'myanmar', 'indonesia']
Creating Vocabulary
vocab=list(set(words)) print(len(vocab))
Output: 48
Top comments (0)