DEV Community

Cover image for Find the Most Frequent Word in Text using Python | NLP Basics Explained
datatoinfinity
datatoinfinity

Posted on

Find the Most Frequent Word in Text using Python | NLP Basics Explained

Ever wondered which word appears the most in a text? Whether you’re analyzing customer feedback, blog posts, or any text data, finding the most frequent word is a common Natural Language Processing (NLP) task. In this post, we’ll explore how to do it in Python, why it matters, and some real-world applications.

Why Find the Most Frequent Word?

Word frequency analysis helps in:

  • Keyword extraction for SEO and blogs.
  • Sentiment analysis in customer feedback.
  • Topic modeling in large text datasets.
  • Chatbots & AI models for training data.

Steps to Find the Most Frequent Word

  1. Prepare Text
  2. Tokenize and Stop Word Removal
  3. Count Word Frequency
  4. DataFrame for word and frequency

Code

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import nltk

corpora = [
    'Artificial Intelligence is transforming the world. AI is used in healthcare, finance, and education. Machine Learning, a branch of AI, powers recommendation systems and predictive analytics.',
    'Success is not the key to happiness. Happiness is the key to success. If you love what you do, you will be successful.',
    'The product is great. The quality is great and the price is reasonable. I will recommend this product to my friends because the product is worth the price.',
    'The football match was intense. The players gave their best. The match ended with a thrilling victory. Fans celebrated the match with great excitement.'
]

stop_words = set(stopwords.words('english'))

all_data = []  

for i, corpus in enumerate(corpora, start=1):
    words = [word.lower() for word in word_tokenize(corpus) 
             if word.lower() not in stop_words and word.isalpha()]
    
    
    word_freq = {}
    for word in words:
        word_freq[word] = word_freq.get(word, 0) + 1
    
   
    for word, freq in word_freq.items():
        all_data.append({"Corpus": f"Corpus_{i}", "Word": word, "Frequency": freq})

df = pd.DataFrame(all_data)

# Sort by frequency for better readability
df = df.sort_values(by=["Corpus", "Frequency"], ascending=[True, False])

print(df)

Importing Library and Module

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import nltk

Explanation:

  1. import nltk NLTK, or the Natural Language Toolkit, is a prominent open-source library in Python designed for Natural Language Processing (NLP).
  2. from nltk.corpus import stopwords it import stopwords corpus.
  3. from nltk.tokenize import word_tokenize it helps to tokenize sentence which mean extract words from sentence.

Prepare the Text

corpora = [
    'Artificial Intelligence is transforming the world. AI is used in healthcare, finance, and education. Machine Learning, a branch of AI, powers recommendation systems and predictive analytics.',
    'Success is not the key to happiness. Happiness is the key to success. If you love what you do, you will be successful.',
    'The product is great. The quality is great and the price is reasonable. I will recommend this product to my friends because the product is worth the price.',
    'The football match was intense. The players gave their best. The match ended with a thrilling victory. Fans celebrated the match with great excitement.'
]

Process each Corpus

stop_words = set(stopwords.words('english'))
all_data = [] 
for i, corpus in enumerate(corpora, start=1):
    words = [word.lower() for word in word_tokenize(corpus) 
             if word.lower() not in stop_words and word.isalpha()]

Explanation:

  1. stop_words=set(stopwords.words('english')) it load all the stopwords in english.
  2. all_data=[] to store all word-frequency data.
  3. for i,corpus in enumerate(corpora,start=1)
    • Loop over the corpora list.
    • enumerate gives both:
      • i index of the corpus
      • corpus actual text string
  4. words = [word.lower() for word in word_tokenize(corpus) if word.lower() not in stop_words and word.isalpha()]
    • word.lower() make it text in lowercase.
    • for word in word_tokenize(corpus) iterate through tokenize word which is tokenized using word_tokenize() function.
    • if word.lower() not in stop_words and word.isalpha() now it will say if word is in lowercase not in stop_words and alphabet.

Count Frequency

word_freq = {}
    for word in words:
        word_freq[word] = word_freq.get(word, 0) + 1

Explanation:

  1. word_freq={} make dictionary where it store frequency of words.
  2. for word in words iterate through words list.
  3. word_freq[word]=word_freq.get(word, 0) + 1 word_freq.get(word, 0) checks if the word exists in the dictionary. If yes, returns its current count. If no, returns 0 (default value). Then + 1 increments the count by 1.

Convert to list of dicts for DataFrame

for word, freq in word_freq.items():
        all_data.append({"Corpus": f"Corpus_{i}", "Word": word, "Frequency": freq})

Create Pandas DataFrame

df = pd.DataFrame(all_data)
df = df.sort_values(by=["Corpus", "Frequency"], ascending=[True, False])
print(df)

Whole Code

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import nltk

# Download NLTK resources (run once)
# nltk.download('punkt')
# nltk.download('stopwords')

# Define corpora
corpora = [
    'Artificial Intelligence is transforming the world. AI is used in healthcare, finance, and education. Machine Learning, a branch of AI, powers recommendation systems and predictive analytics.',
    'Success is not the key to happiness. Happiness is the key to success. If you love what you do, you will be successful.',
    'The product is great. The quality is great and the price is reasonable. I will recommend this product to my friends because the product is worth the price.',
    'The football match was intense. The players gave their best. The match ended with a thrilling victory. Fans celebrated the match with great excitement.'
]

stop_words = set(stopwords.words('english'))

all_data = []  # to store all word-frequency data

# Process each corpus
for i, corpus in enumerate(corpora, start=1):
    words = [word.lower() for word in word_tokenize(corpus) 
             if word.lower() not in stop_words and word.isalpha()]
    
    # Count frequency
    word_freq = {}
    for word in words:
        word_freq[word] = word_freq.get(word, 0) + 1
    
    # Convert to list of dicts for DataFrame
    for word, freq in word_freq.items():
        all_data.append({"Corpus": f"Corpus_{i}", "Word": word, "Frequency": freq})

# Create Pandas DataFrame
df = pd.DataFrame(all_data)

# Sort by frequency for better readability
df = df.sort_values(by=["Corpus", "Frequency"], ascending=[True, False])

print(df)

Output:

     Corpus            Word  Frequency
4   Corpus_1              ai          2
0   Corpus_1      artificial          1
1   Corpus_1    intelligence          1
2   Corpus_1    transforming          1
3   Corpus_1           world          1
5   Corpus_1            used          1
6   Corpus_1      healthcare          1
7   Corpus_1         finance          1
8   Corpus_1       education          1
9   Corpus_1         machine          1

Download nltk important library
Install Stopwords module

Information Extraction in NLP

Top comments (0)