Shun Yamada

Posted on Dec 27, 2019

How to extract high-frequency words in NLTK

#python #nltk

While reading an official document for NLTK(Natural Language Toolkit), I tried extracting words which are frequently used in a sample text. This time, I tried to let the most frequency three words be in a display.

Development

Python
NLTK

Install NLTK

$ pip install nltk

Extract High-frequency words

Let me the coding begins. You should download punkt and averaged_perception_tagger initially for running word-tokenizing a part-of-speech acquisition. Next, read a sample text, and convert it to word-separation from text. And remove non-Noun things from this result. Finally, get the most frequent words.

Download

import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Import nltk, and then download punkt and averaged_perception_trigger. Once downloaded in the environment, you don't have to do it again.

Convert texts to word-tokenizing

raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

tokens_l = [w.lower() for w in tokens]

Prepare some essays or long texts. After reading this, it should be word-tokenized. Then, set up capital cases to lower cases, they should be recognized as the same.

Extract only Noun

only_nn = [x for (x,y) in pos if y in ('NN')]

freq = nltk.FreqDist(only_nn)

Remove non-noun words from this result. And calculate how frequency these words are included.

Get the most frequent three words

print(freq.most_common(3))

After counting frequent words, you can get the top three ones by most_common().

DEV Community

How to extract high-frequency words in NLTK

Development

Install NLTK

Extract High-frequency words

Download

Convert texts to word-tokenizing

Extract only Noun

Get the most frequent three words

Meet your AI code assistant

Top comments (0)

See why 4M developers consider Sentry, “not bad.”

Read next

The Kth factor of N - an O(sqrt n) algorithm

Convert Emojis to Text in SMS with Infobip: A Step-by-Step Guide

Code Better, Debug Smarter: Tips Every Developer Needs

Introduction to Textual: Building Modern Text User Interfaces in Python

Okay