Shun Yamada
How to extract high-frequency words in NLTK

While reading an official document for NLTK(Natural Language Toolkit), I tried extracting words which are frequently used in a sample text. This time, I tried to let the most frequency three words be in a display.


Install NLTK

$ pip install nltk
Extract High-frequency words

Let me the coding begins. You should download punkt and averaged_perception_tagger initially for running word-tokenizing a part-of-speech acquisition. Next, read a sample text, and convert it to word-separation from text. And remove non-Noun things from this result. Finally, get the most frequent words.


import nltk'punkt')'averaged_perceptron_tagger')
Import nltk, and then download punkt and averaged_perception_trigger. Once downloaded in the environment, you don't have to do it again.

Convert texts to word-tokenizing

raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

tokens_l = [w.lower() for w in tokens]
Prepare some essays or long texts. After reading this, it should be word-tokenized. Then, set up capital cases to lower cases, they should be recognized as the same.

Extract only Noun

only_nn = [x for (x,y) in pos if y in ('NN')]

freq = nltk.FreqDist(only_nn)
Remove non-noun words from this result. And calculate how frequency these words are included.

Get the most frequent three words

After counting frequent words, you can get the top three ones by most_common().

