Ever wondered which word appears the most in a text? Whether you’re analyzing customer feedback, blog posts, or any text data, finding the most frequent word is a common Natural Language Processing (NLP) task. In this post, we’ll explore how to do it in Python, why it matters, and some real-world applications.
Why Find the Most Frequent Word?
Word frequency analysis helps in:
- Keyword extraction for SEO and blogs.
- Sentiment analysis in customer feedback.
- Topic modeling in large text datasets.
- Chatbots & AI models for training data.
Steps to Find the Most Frequent Word
- Prepare Text
- Tokenize and Stop Word Removal
- Count Word Frequency
- DataFrame for word and frequency
Code
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import pandas as pd import nltk corpora = [ 'Artificial Intelligence is transforming the world. AI is used in healthcare, finance, and education. Machine Learning, a branch of AI, powers recommendation systems and predictive analytics.', 'Success is not the key to happiness. Happiness is the key to success. If you love what you do, you will be successful.', 'The product is great. The quality is great and the price is reasonable. I will recommend this product to my friends because the product is worth the price.', 'The football match was intense. The players gave their best. The match ended with a thrilling victory. Fans celebrated the match with great excitement.' ] stop_words = set(stopwords.words('english')) all_data = [] for i, corpus in enumerate(corpora, start=1): words = [word.lower() for word in word_tokenize(corpus) if word.lower() not in stop_words and word.isalpha()] word_freq = {} for word in words: word_freq[word] = word_freq.get(word, 0) + 1 for word, freq in word_freq.items(): all_data.append({"Corpus": f"Corpus_{i}", "Word": word, "Frequency": freq}) df = pd.DataFrame(all_data) # Sort by frequency for better readability df = df.sort_values(by=["Corpus", "Frequency"], ascending=[True, False]) print(df)
Importing Library and Module
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import pandas as pd import nltk
Explanation:
-
import nltk
NLTK, or the Natural Language Toolkit, is a prominent open-source library in Python designed for Natural Language Processing (NLP). -
from nltk.corpus import stopwords
it import stopwords corpus. -
from nltk.tokenize import word_tokenize
it helps to tokenize sentence which mean extract words from sentence.
Prepare the Text
corpora = [ 'Artificial Intelligence is transforming the world. AI is used in healthcare, finance, and education. Machine Learning, a branch of AI, powers recommendation systems and predictive analytics.', 'Success is not the key to happiness. Happiness is the key to success. If you love what you do, you will be successful.', 'The product is great. The quality is great and the price is reasonable. I will recommend this product to my friends because the product is worth the price.', 'The football match was intense. The players gave their best. The match ended with a thrilling victory. Fans celebrated the match with great excitement.' ]
Process each Corpus
stop_words = set(stopwords.words('english')) all_data = [] for i, corpus in enumerate(corpora, start=1): words = [word.lower() for word in word_tokenize(corpus) if word.lower() not in stop_words and word.isalpha()]
Explanation:
-
stop_words=set(stopwords.words('english'))
it load all the stopwords in english. -
all_data=[]
to store all word-frequency data. -
for i,corpus in enumerate(corpora,start=1)
- Loop over the corpora list.
-
enumerate
gives both:-
i
index of the corpus -
corpus
actual text string
-
-
words = [word.lower() for word in word_tokenize(corpus) if word.lower() not in stop_words and word.isalpha()]
-
word.lower()
make it text in lowercase. -
for word in word_tokenize(corpus)
iterate through tokenize word which is tokenized using word_tokenize() function. -
if word.lower() not in stop_words and word.isalpha()
now it will say if word is in lowercase not in stop_words and alphabet.
-
Count Frequency
word_freq = {} for word in words: word_freq[word] = word_freq.get(word, 0) + 1
Explanation:
-
word_freq={}
make dictionary where it store frequency of words. -
for word in words
iterate through words list. -
word_freq[word]=word_freq.get(word, 0) + 1
word_freq.get(word, 0) checks if the word exists in the dictionary. If yes, returns its current count. If no, returns 0 (default value). Then + 1 increments the count by 1.
Convert to list of dicts for DataFrame
for word, freq in word_freq.items(): all_data.append({"Corpus": f"Corpus_{i}", "Word": word, "Frequency": freq})
Create Pandas DataFrame
df = pd.DataFrame(all_data) df = df.sort_values(by=["Corpus", "Frequency"], ascending=[True, False]) print(df)
Whole Code
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import pandas as pd import nltk # Download NLTK resources (run once) # nltk.download('punkt') # nltk.download('stopwords') # Define corpora corpora = [ 'Artificial Intelligence is transforming the world. AI is used in healthcare, finance, and education. Machine Learning, a branch of AI, powers recommendation systems and predictive analytics.', 'Success is not the key to happiness. Happiness is the key to success. If you love what you do, you will be successful.', 'The product is great. The quality is great and the price is reasonable. I will recommend this product to my friends because the product is worth the price.', 'The football match was intense. The players gave their best. The match ended with a thrilling victory. Fans celebrated the match with great excitement.' ] stop_words = set(stopwords.words('english')) all_data = [] # to store all word-frequency data # Process each corpus for i, corpus in enumerate(corpora, start=1): words = [word.lower() for word in word_tokenize(corpus) if word.lower() not in stop_words and word.isalpha()] # Count frequency word_freq = {} for word in words: word_freq[word] = word_freq.get(word, 0) + 1 # Convert to list of dicts for DataFrame for word, freq in word_freq.items(): all_data.append({"Corpus": f"Corpus_{i}", "Word": word, "Frequency": freq}) # Create Pandas DataFrame df = pd.DataFrame(all_data) # Sort by frequency for better readability df = df.sort_values(by=["Corpus", "Frequency"], ascending=[True, False]) print(df)
Output:
Corpus Word Frequency 4 Corpus_1 ai 2 0 Corpus_1 artificial 1 1 Corpus_1 intelligence 1 2 Corpus_1 transforming 1 3 Corpus_1 world 1 5 Corpus_1 used 1 6 Corpus_1 healthcare 1 7 Corpus_1 finance 1 8 Corpus_1 education 1 9 Corpus_1 machine 1
Download nltk important library
Install Stopwords module
Top comments (0)