DEV Community

Shun Yamada
Shun Yamada

Posted on

How to extract high-frequency words in NLTK

While reading an official document for NLTK(Natural Language Toolkit), I tried extracting words which are frequently used in a sample text. This time, I tried to let the most frequency three words be in a display.


  • Python
  • NLTK

Install NLTK

$ pip install nltk
Enter fullscreen mode Exit fullscreen mode

Extract High-frequency words

Let me the coding begins. You should download punkt and averaged_perception_tagger initially for running word-tokenizing a part-of-speech acquisition. Next, read a sample text, and convert it to word-separation from text. And remove non-Noun things from this result. Finally, get the most frequent words.


import nltk'punkt')'averaged_perceptron_tagger')
Enter fullscreen mode Exit fullscreen mode

Import nltk, and then download punkt and averaged_perception_trigger. Once downloaded in the environment, you don't have to do it again.

Convert texts to word-tokenizing

raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

tokens_l = [w.lower() for w in tokens]
Enter fullscreen mode Exit fullscreen mode

Prepare some essays or long texts. After reading this, it should be word-tokenized. Then, set up capital cases to lower cases, they should be recognized as the same.

Extract only Noun

only_nn = [x for (x,y) in pos if y in ('NN')]

freq = nltk.FreqDist(only_nn)
Enter fullscreen mode Exit fullscreen mode

Remove non-noun words from this result. And calculate how frequency these words are included.

Get the most frequent three words

Enter fullscreen mode Exit fullscreen mode

After counting frequent words, you can get the top three ones by most_common().

Top comments (0)