DEV Community

Cover image for First steps in text processing with NLTK: text tokenization and analysis
jesusramirezs
jesusramirezs

Posted on

First steps in text processing with NLTK: text tokenization and analysis

I have already had the opportunity to talk about NLTK in two of my previous articles (link#1, link#2).

In this article, I would like to review some possibilities of NLTK.

The kind of examples discussed in articles like this one fall under what is called natural language processing (NLP). We can apply these techniques to different categories of texts to obtain very varied results:

  • Automatic summaries
  • Sentiment analysis
  • Keyword extraction for search engines
  • Content recommendation
  • Opinion research (in marketplaces, aggregators, etc...)
  • Offensive language filters
  • ...

NLTK is not the only one package in this field. There are other alternatives to NLTK for these types of tasks, such as:

  • Apache OpenNLP.
  • Stanford NLP suite.
  • Gate NLP library.

To start experimenting, in the first place we will install NLTK from the Python CLI, which is very simple:

pip install nltk
Enter fullscreen mode Exit fullscreen mode

Once the NLTK library is installed, we can install different packages from the Python command-line interface, like the Punkt sentence tokenizer :

import nltk

nltk.download('punkt')
Enter fullscreen mode Exit fullscreen mode

One of the most important things to do before tackling any natural language processing task is "text tokenization". This phase can be critical because otherwise, it will be much more challenging to process the text.

Tokenization, also known as text segmentation or linguistic analysis, consists of conceptually dividing text or text strings into smaller parts such as sentences, words, or symbols. As a result of the tokenization process, we will get a list of tokens.

NLTK includes both a phrase tokenizer and a word tokenizer. A text can be converted into sentences; sentences can be tokenized into words, etc.

We have, for example, this text (from Wikipedia - Stoicism):

para = "Stoicism is a school of Hellenistic philosophy founded by Zeno of Citium in Athens in the early 3rd century BC. It is a philosophy of personal ethics informed by its system of logic and its views on the natural world. According to its teachings, as social beings, the path to eudaimonia (happiness, or blessedness) is found in accepting the moment as it presents itself, by not allowing oneself to be controlled by the desire for pleasure or by the fear of pain, by using one's mind to understand the world and to do one's part in nature's plan, and by working together and treating others fairly and justly."
Enter fullscreen mode Exit fullscreen mode

To perform Tokenizing in Python is simple. We import the NLTK library and precisely the sent_tokenize function that will return a vector with a token for each phrase.

from nltk.tokenize import sent_tokenize

tokenized_l1 = sent_tokenize(para)

print(tokenized_l1)
Enter fullscreen mode Exit fullscreen mode

and we will get the following result:

['Stoicism is a school of Hellenistic philosophy founded by Zeno of Citium in Athens in the early 3rd century BC.', 'It is a philosophy of personal ethics informed by its system of logic and its views on the natural world.', "According to its teachings, as social beings, the path to eudaimonia (happiness, or blessedness) is found in accepting the moment as it presents itself, by not allowing oneself to be controlled by the desire for pleasure or by the fear of pain, by using one's mind to understand the world and to do one's part in nature's plan, and by working together and treating others fairly and justly."]
Enter fullscreen mode Exit fullscreen mode

Likewise, we can tokenize a sentence to obtain a list of words:

from nltk.tokenize import word_tokenize

sentence1 = tokenized_l1[0]

print(word_tokenize(sentence1))

['Stoicism', 'is', 'a', 'school', 'of', 'Hellenistic', 'philosophy', 'founded', 'by', 'Zeno', 'of', 'Citium', 'in', 'Athens', 'in', 'the', 'early', '3rd', 'century', 'BC', '.']
Enter fullscreen mode Exit fullscreen mode

Well, let's do something a little closer to a real case; for example, extract some statistics from an article. We can take the content of the web page and then analyze the text to draw some conclusions from the text.

For this, we can use urllib.request to get the HTML content of our target page:

import urllib.request

response = urllib.request.urlopen('https://en.wikipedia.org/wiki/Stoicism')

html = response.read()

print(html)
Enter fullscreen mode Exit fullscreen mode

and use BeautifulSoup. This is a very useful Python library for extracting data from HTML and XML documents and setting different filtering and noise removal levels. We can extract only the text of the page without HTML markup by using get_text() or a custom solution like in the example code bellow.

pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"html.parser")

title = soup.select("#firstHeading")[0].text

paragraphs = soup.select("p")

intro = '\n'.join([ para.text for para in paragraphs[0:4]])

print(intro)
Enter fullscreen mode Exit fullscreen mode

Finally, we can go on to convert the text obtained into tokens by dividing the text as described above:

tokens = word_tokenize(intro)

print (tokens)
Enter fullscreen mode Exit fullscreen mode

From here, we can apply different tools to “standardize” our token set. For example, to convert all tokens to lowercase:

new_tokens = []
for token in tokens:
    new_token = token.lower()
    new_tokens.append(new_token)
tokens = new_tokens
Enter fullscreen mode Exit fullscreen mode

...remove punctuation:

import re
new_tokens = []
for token in tokens:
    new_token = re.sub(r'[^\w\s]', '', token)
    if new_token != '':
        new_tokens.append(new_token)
tokens = new_tokens
Enter fullscreen mode Exit fullscreen mode

...replace numbers with their textual representation using Inflect, which is a library that generate plurals, singular nouns, ordinals, indefinite articles and convert numbers to words:

pip install inflect
Enter fullscreen mode Exit fullscreen mode
import inflect
p = inflect.engine()
new_tokens = []
for token in tokens:
    if token.isdigit():
        new_token = p.number_to_words(token)
        new_tokens.append(new_token)
    else:
        new_tokens.append(token)
tokens = new_tokens
Enter fullscreen mode Exit fullscreen mode

... and remove stopwords, which are words that don't add significantly any sense to the text.

nltk.download('stopwords')

from nltk.corpus import stopwords

stopwords.words('english')

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

new_tokens = []
for token in tokens:
    if token not in stopwords.words('english'):
        new_tokens.append(token)
tokens = new_tokens
Enter fullscreen mode Exit fullscreen mode

Finally, lemmatisation will allow us to extract the root of each word and thus ignore any inflection (verbal conjugations, plurals...)

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
lemmas = []
for token in tokens:
    lemma = lemmatizer.lemmatize(token, pos='v')
    lemmas.append(lemma)
tokens = lemmas
Enter fullscreen mode Exit fullscreen mode

Tokenization and lemmatization are techniques widely used by me in my last project.

Once all this standardization process is done, we can move on to simple analysis, for example, calculate the frequency distribution of those tokens using a function in NLTK called FreqDist() that does the job correctly:

freq = nltk.FreqDist(tokens)

for key,val in freq.items():

    print (str(key) + ':' + str(val))
Enter fullscreen mode Exit fullscreen mode
material:1
found:2
ethical:1
phrase:2
present:1
others:1
justly:1
teach:2
athens:1
natural:2
especially:1
corruptions:1
health:1
live:1
believe:1
sufficient:1
emotionally:1
misfortune:1
citium:1
moral:1
behave:1
sage:2
mind:1
everything:1
zeno:1
radical:1
epictetusemphasized:1
pain:1
pleasure:1
moment:1
blessedness:1
allow:1
hold:1
philosophy:3
also:1
two:1
know:1
rule:1
adiaphora:1
people:1
fear:1
bc:1
use:1
vicious:1
together:1
accord:1
maintain:1
fairly:1
say:1
prohairesis:1
act:1
ethics:3
form:1
work:1
nature:3
would:1
human:1
social:1
wealth:1
stoicism:1
understand:2
three:1
one:5
mean:1
think:2
system:1
best:1
judgment:1
oneself:1
pleasure:1
accept:1
hellenistic:1
school:1
virtue:4
personal:1
person:2
errors:1
century:1
call:1
equally:1
eudaimonia:1
order:1
free:1
find:1
similar:1
destructive:1
calm:1
plan:1
value:1
alongside:1
indication:1
accordance:1
view:2
certain:1
good:3
seneca:1
bad:1
many:1
result:1
consider:1
root:1
belief:1
resilient:1
path:1
truly:1
control:1
include:1
logic:1
early:1
desire:1
world:2
life:1
inform:1
major:1
happiness:2
part:1
tradition:1
though:1
emotions:1
treat:1
since:1
stoics:3
individual:1
aim:1
aristotelian:1
be:2
upon:1
external:1
stoic:3
approach:1
Enter fullscreen mode Exit fullscreen mode

And finally, we can graphically represent the result in this way.

pip install matplotlib
Enter fullscreen mode Exit fullscreen mode
freq.plot(20, cumulative=False)
Enter fullscreen mode Exit fullscreen mode

Alt Text

This first analysis can help us classify a text and determine how to index we could frame the article within a content aggregator.

We can apply formal techniques to this classification, such as a Naive Bayes classifier; most simply, in its "naive" mode, we use the conditional probabilities of the words in a text to determine which category it belongs to. This algorithm is called "naive" because it calculates each word's conditional probabilities separately as if they were independent of each other. Once we have each word's conditional probabilities in a review, we calculate the joint probability of all of them by using a Pi-product to determine the likelihood that it belongs to the category.

I would like to cover a simple example of applying a Naive Bayes classifier in another article.

To finish this article we can tackle an apparently difficult task wich becomes easy to achieve in Python, at least at a first level of approach: sentiment analysis. By using NLTK we can analyze the feeling of each sentence in the article. Sense analysis is a machine learning technique based on natural language processing, aiming to obtain subjective information from a series of texts or documents.

To do this, we must download the next package:

nltk.download('vader_lexicon')
Enter fullscreen mode Exit fullscreen mode

This package implements VADER ( Valence Aware Dictionary for Sentiment Reasoning), a model used for text sentiment analysis that is sensitive to both positive/negative and strength of emotion.

Next, we will cut the text to be analyzed by using a tokenization process that allows us to divide the different sentences of a paragraph, obtaining each one of them separately.

tokenized_text = sent_tokenize(text)

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment


Enter fullscreen mode Exit fullscreen mode
text = "The habit of reading is one of the greatest resources of mankind; and we enjoy reading books that belong to us much more than if they are borrowed. A borrowed book is like a guest in the house; it must be treated with punctiliousness, with a certain considerate formality. You must see that it sustains no damage; it must not suffer while under your roof. You cannot leave it carelessly, you cannot mark it, you cannot turn down the pages, you cannot use it familiarly. And then, some day, although this is seldom done, you really ought to return it...
Enter fullscreen mode Exit fullscreen mode

(William Lyon Phelps- The pleasure of Books, from http://www.historyplace.com/speeches/phelps.htm)

...to finally instanciate the sentiment analyzer and apply it to each sentence.

analyzer = SentimentIntensityAnalyzer()

for sentence in tokenized_text:
    print(sentence)
    scores = analyzer.polarity_scores(sentence)
    for key in scores:
        print(key, ': ', scores[key])
        print()
Enter fullscreen mode Exit fullscreen mode

As a result we can examine each of the phrases separately.
These are some example results:

The habit of reading is one of the greatest resources of mankind; and we enjoy reading books that belong to us much more than if they are borrowed.
neg :  0.0

pos :  0.222

neu :  0.778

compound :  0.8126

A borrowed book is like a guest in the house; it must be treated with punctiliousness, with a certain considerate formality.
neg :  0.0

pos :  0.333

neu :  0.667

compound :  0.7579

You must see that it sustains no damage; it must not suffer while under your roof.
neg :  0.254

pos :  0.134

neu :  0.612

compound :  -0.3716

You cannot leave it carelessly, you cannot mark it, you cannot turn down the pages, you cannot use it familiarly.
neg :  0.0

pos :  0.138

neu :  0.862

compound :  0.2235

...

And there is no doubt that in these books you see these men at their best.
neg :  0.215

pos :  0.192

neu :  0.594

compound :  0.128
Enter fullscreen mode Exit fullscreen mode

For each sentence, several different scores are obtained, which we will see in the output a little further down:

  • neg (negative): to tell us how negative this sentence would be.
  • neu (neutral): this second value indicates the neutrality of a phrase and a score between zero and one.
  • pos (positive): Same as the previous ones, but indicating how positive a phrase is.
  • compound: this is a value between -1 and 1 that indicates at once whether the phrase is positive or negative. Values close to -1 suggest that it is very negative, close to zero would mean that it is neutral, and close to, that it would be very positive.

If you are interested in this field of analysis probably you can find more suitable texts available for sentiment analysis, like political opinions, product reviews... this is just an example.

Thanks for reading this article. If you have any questions, feel free to comment below.

Connect with me on Twitter or LinkedIn

Top comments (0)