NLTK is a sophisticated library. Continuously developed since 2009, it supports all classical NLP tasks, from tokenization, stemming, part-of-speech tagging, and including semantic index and dependency parsing. It also has a rich set of additional features, such as built-in corpora, different models for its NLP tasks, and integration with SciKit Learn and other Python libraries.
This article is a concise introduction to NLTK. You will see NLTK in action, short code-snippet that you can use for a variety of NLP tasks.
This article originally appeared at my blog admantium.com.
The technical context of this article is Python v3.11
and NLTK v3.8.1
. All examples should work with newer versions too.
NLTK Library Installation
NLTK can be installed via Python pip:
python3 -m pip install nltk
Several NLTK features require additional data to be used, such as stop words or integrated corpus. For this, the built-in downloader is used. Here is an example:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('reuters')
Other parts, like specialized tokenizer or stop words, require Java libraries to be installed. See this Github Gist to get started.
NLP Tasks
NLTK supports several NLP tasks. Here is a short overview, and the next sections provide more details:
- Text Processing
- Tokenization
- Stemming
- Lemmatization
- Text Syntax
- Part-of-Speech Tagging
- Text Semantics
- Named Entity Recognition
- Document Semantics
- Clustering
- Classification
Furthermore, NLTK supports these additional features:
- Datasets
- Corpus Management
- Machine Learning Clustering and Classification Models
Text Processing
Tokenization
Tokenizing is an essential first step in text processing. In general, the tokenization approach should be chosen dependent on project requirements and subsequent NLP tasks. For example, when a text contains multi-nouns words that represent entities or persons, but the tokenizer just splits by whitespace, named entity recognition becomes hard.
NLTK provides a simple whitespace tokenizer, several built-in tokenizers, such as NIST or Stanford, and options for custom tokenizers based on regular expressions.
Here is an example of the built-in sentence and word tokenizer:
from nltk.tokenize import sent_tokenize, word_tokenize
# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
paragraph = '''Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.'''
sentences = []
for sent in sent_tokenize(paragraph):
sentences.append(word_tokenize(sent))
sentences[0]
# ['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline'
Stemming and Lemmatization
Like tokenization, choosing suitable stemming (replace inflected words with their word stem, like cooking with cook) and lemmatization (replace word groups with their lemma) approaches are dependent on the subsequent NLP tasks. Lemmatization has a special role because it requires some part-of-speech tagging or word sense disambiguation to correctly identify the word groups.
NLTK provides several stemmer modules, such as Porter, Lancaster and Isri. For lemmatization, only Wordnet is provided.
Lets compare stemming and lemmatization of the first sentence in the Wikipedia article about artificial intelligence.
from nltk.stem import LancasterStemmer
sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'
stemmer = LancasterStemmer()
stemmed_sent = [stemmer.stem(word) for word in word_tokenize(sent)]
print(stemmed_sent)
# ['art', 'intellig', 'was', 'found', 'as', 'an', 'academ', 'disciplin',
And the same sentence processed with the WordNet lemmatizer:
from nltk.stem import WordNetLemmatizer
sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in word_tokenize(sent)]
print(lemmas)
# ['Artificial', 'intelligence', 'wa', 'founded', 'a', 'an', 'academic', 'discipline'
Text Syntax
Part-of-Speech Tagging
NLTK also provides different part of speech taggers (pos). With the built-in tagger, following annotations are produced:
Tag | Meaning |
---|---|
ADJ | adjective |
ADP | adposition |
ADV | adverb |
CONJ | conjunction |
DET | determiner, article |
NOUN | noun |
NUM | numeral |
PRT | particle |
PRON | pronoun |
VERB | verb |
. | punctuation marks |
X | other |
Taking the first sentence from the Wikipedia article about artificial intelligence, part of speech tagging produces the following result.
from nltk import pos_tag
sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'
pos_tag(sentences[0])
# [('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),
# ('as', 'IN'),
# ('an', 'DT'),
# ('academic', 'JJ'),
# ('discipline', 'NN'),
To use the other NLTK pos taggers, such as Stanford or Brill, external Java libraries need to be downloaded.
Text Semantics
Named Entity Recognition
NLTK includes pre-trained NER taggers, but several additional packages need to be downloaded first.
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
The NER tagger consumes a POS tagged sentence and adds the classification labels to the representation. Using it on the sample paragraph yields no results, so the following example takes another sentence from the Wikipedia article in which persons are mentioned.
from nltk.tokenize import sent_tokenize
# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
sentence= '''
In 2011, in a Jeopardy! quiz show exhibition match, IBM's question answering system, Watson, defeated the two greatest Jeopardy! champions, Brad Rutter and Ken Jennings, by a significant margin.
'''
tagged_sentence = nltk.pos_tag(word_tokenize(sentence))
tagged_sentence
# [('In', 'IN'),
# ('2011', 'CD'),
# (',', ','),
# ('in', 'IN'),
# ('a', 'DT'),
# ('Jeopardy', 'NN'),
print(nltk.ne_chunk(tagged_sentence))
# (S
# In/IN
# 2011/CD
# ,/,
# in/IN
# a/DT
# Jeopardy/NN
# !/.
# quiz/NN
# show/NN
# exhibition/NN
# match/NN
# ,/,
# (ORGANIZATION IBM/NNP)
# 's/POS
# question/NN
# answering/NN
# system/NN
# ,/,
# (PERSON Watson/NNP)
As you see, the person Watson
and the organization IBM
are recognized.
Document Semantics
Clustering
Three clustering algorithms are supported, see the complete documentation.
- K-Means
- EM Cluster
- Group Average Agglomerative Clusterer (GAAC)
Classification
Following classifiers are implemented in NLTK, also see the complete documention.
- Decision Tree
- Maximum Entropy Modelling
- Megam maxent optimization
- Naive Bayes (and variants)
External packages, like TextCat for language identification, the Java library Weka, or SciKitLearn classifiers are supported.
Additional Features
Datasets
NLTK provides more than 100 built-in corpora, see the complete list. Some examples: Reuters news articles, Treebank 2 Wall Street Journal Campus, Twitter news or the WordNet lexical database.
Here is an example how to access the Reuters corpus.
from nltk.corpus import reuters
print(reuters.categories()[:10])
#['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee']
print(reuters.fileids()[:10])
# ['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843']
sample = 'test/14829'
categories = reuters.categories(sample)
print(categories)
# ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee']
content = ""
with reuters.open(sample) as stream:
content = stream.read()
print(f"Categories #{categories} / file #{sample}")
# Categories #['crude', 'nat-gas'] / file #test/14829
print(f"Content:\#{content}")
# Content:\#JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWARDS
# The Ministry of International Trade and
# Industry (MITI) will revise its long-term energy supply/demand
# outlook by August to meet a forecast downtrend in Japanese
# energy demand, ministry officials said.
# MITI is expected to lower the projection for primary energy
# supplies in the year 2000 to 550 mln kilolitres (kl) from 600
# mln, they said.
# The decision follows the emergence of structural changes in
# Japanese industry following the rise in the value of the yen
# and a decline in domestic electric power demand.
# MITI is planning to work out a revised energy supply/demand
# outlook through deliberations of committee meetings of the
# Agency of Natural Resources and Energy, the officials said.
# They said MITI will also review the breakdown of energy
# supply sources, including oil, nuclear, coal and natural gas.
# Nuclear energy provided the bulk of Japan's electric power
# in the fiscal year ended March 31, supplying an estimated 27
# pct on a kilowatt/hour basis, followed by oil (23 pct) and
# liquefied natural gas (21 pct), they noted.
Corpus Management
Corpus Reader
NLTKs corpus reader objects provide reading, filtering, decoding, and preprocessing structured file lists or zip files.
Many different corpus reader objects exist, see the full list. The most common readers are:
- PlaintextCorpusReader: Read text documents in which paragraphs are split into blank lines.
- Markdown: Process markdown files in which its categories are represented in the file names
- Tagged: Special corpus reader object that expect already tagged corpus, such as the Conl. Note that for several built-in corpus objects tagged versions already exist.
- Twitter: Process tweets in JSON format
- XML: Process XML files
As a short example, here is a PlaintextCorpusReader
that will read *.txt
files.
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')
print(corpus.fileids())
# ['AI_alignment.txt', 'AI_safety.txt', 'Artificial_intelligence.txt', 'Machine_learning.txt']
print(corpus.sents())
# [['In', 'the', 'field', 'of', 'artificial', 'intelligence', '(', 'AI', '),', 'AI', 'alignment', 'research', 'aims', 'to', 'steer', 'AI', 'systems', 'towards', 'humans', '’', 'intended', 'goals', ',', 'preferences', ',', 'or', 'ethical', 'principles', '.'], ['An', 'AI', 'system', 'is', 'considered', 'aligned', 'if', 'it', 'advances', 'the', 'intended', 'objectives', '.'], ...]
Text Collection
Another utility to access structured information from a corpus is the TextCollection class. Instantiated on tokenized texts, it provides the following functions:
-
collocations(num, window_size)
: Return up tonum
tuples ofwindow_size
length with words appearing collocated -
collocation_list(num, window_size)
: Outputs collocated words as a list of tuples -
common_contexts(word, num)
: Print the context in whichword
appears -
concordance(word, width, lines)
: Prints the concordance for the givenword
(individual words or a sentencs) -
concordance_list(word, width, lines)
: Prints a lists of tuples -
count(word)
: Absolute appearance of word -
tf
,idf
,tf_idf
: Frequencies of words -
generate
: Create random text based on a trigram language model. -
vocab
: frequency distribution of all tokens -
plot
: Draw the frequency distribution
Here is an example:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.text import TextCollection
corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')
col = TextCollection(corpus.sents())
print(col.count('the'))
# 973
print(col.common_contexts(['intelligence']))
# artificial_( general_( artificial_. artificial_is general_,
# artificial_, artificial_in artificial_". artificial_and "_"
# artificial_was general_and general_. artificial_; artificial_" of_or
# artificial_– artificial_to artificial_: and_.
Machine Learning Clustering and Classification Models
NLTK provides several clustering and classification algorithms. But before using any algorithm, features need to be manually designed and extracted from the texts.
On the API documentation page about classification, the steps are defined as follows:
- Define the features that are relevant to the ML task
- Implement methods that extract the features from the corpora (e.g. word frequency from documents)
- Create a Python dictionary object that contains individual tuples with
(feature_name, labels)
and pass them to the training algorithm
Let’s illustrate this with an example from the NLTK Handbook to build a text classifier.
First, we build a vocabulary of all words:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')
vocab = nltk.FreqDist(w.lower() for w in corpus.words())
# FreqDist({'the': 65590, ',': 63310, '.': 52247, 'of': 39000, 'and': 30868, 'a': 30130, 'to': 27881, 'in': 24501, '-': 19867, '(': 18243, ...})
all_words = nltk.FreqDist(w.lower() for w in corpus.words())
word_features = list(all_words)
# ['the', ':']
Second, we define a method that returns the one-hot-encoded word vector that expresses if a word is present in the document or not. The resulting feature vector must contain boolean values in order do be useable for classification tasks.
def document_features(document):
document_words = set(corpus.words(document))
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
f = document_features('Artificial_intelligence.txt')
# {'contains(the)': True,
# 'contains(,)': True,
# 'contains(.)': True,
Third, we select a classification algorithm and pass the featurized documents to it.
featuresets = [(document_features(d), d) for d in corpus.fileids()]
featuresets
# featuresets = [(document_features(d), d) for d in corpus.fileids()]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
# <nltk.classify.naivebayes.NaiveBayesClassifier at 0x185ec5dd0>
Summary
NLTK is a versatile library that supports several NLP tasks. For the core tasks of tokenizing, stemming/lemmatization, and part of speech tagging, built-in methods as well as methods from scientific papers are include. For managing a corpus of documents, NLTK handles Text, Markdown, XML and other formats, and provides an API to fetch files, categories, sentences and words. Especially helpful is the TextCollection
class that enables the gathering of word collocations and computing term frequencies. Finally NLTK also offers clustering and classification algorithms such as KMeans, Decision Trees or Naive Bayes.
Latest comments (0)