DEV Community

Cover image for Stemming vs Lemmatization - What is the difference?
Purity-Nyagweth
Purity-Nyagweth

Posted on

Stemming vs Lemmatization - What is the difference?

Introduction

Stemming and Lemmatization are techniques used in text processing. In Natural Language Processing (NLP), text processing is needed to normalize the text. The aim of text normalization is to reduce the amount of information that a machine has to handle thus improving the efficiency of the machine learning process.
Both stemming and lemmatization involves reducing the inflectional forms of words to their root forms. Inflection forms of words are words that are derived from the root or base form of a word. For example, the words jumped, jumping and jumps are inflectional forms of the root word jump. Likewise, creating, created, creates are inflectional forms of the root word create, and so on.

Prerequisites

  • Basic knowledge of python programming
  • Python installed
  • Natural Language Toolkit(nltk) package installed

What is the difference between stemming and lemmatization?

The main difference between stemming and lemmatization is that stemming chops off the suffixes of a word to reduce a word to its root form while lemmatization first takes into consideration the context of a word and makes use of the context to convert the word to its meaningful base form which is known as lemma.

Below are examples of words that stemming and lemmatization have been performed on.

Stemming Examples

Word --- Porter Stemmer

  • jumped --- jump
  • friends --- friend
  • football --- footbal
  • mysteries --- mysteri
  • created --- creat
  • took --- took

Lemmatization Examples

Word --- Lemmatized word

  • jumped --- jump
  • friends --- friend
  • football --- football
  • mysteries --- mystery
  • created --- create
  • took --- take

How to carry out stemming

Natural Language Toolkit(nltk) package has two stemmers for the English Language. These stemmers are PorterStemmer and LancasterStemmer.
We are going to use PorterStemmer to carryout stemming.

First let's import PorterStemmer

from nltk.stem import PorterStemmer
Enter fullscreen mode Exit fullscreen mode

Let's now create a list of words that we want to stem

word_list = ["jumped", "friendship", "friends", "swimming","creation","stability","writing",
             "realize","mystery","football", "mysteries", "created", "took"]
Enter fullscreen mode Exit fullscreen mode

We will now stem every word in the list and then print the word with its stemmed version.

stemmer = PorterStemmer()

for word in word_list:
    print((word,stemmer.stem(word)))
Enter fullscreen mode Exit fullscreen mode

Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stabil')
('writing', 'write')
('realize', 'realiz')
('mystery', 'mysteri')
('football', 'footbal')
('mysteries', 'mysteri')
('created', 'creat')
('took', 'took')

How to carry out lemmatization

As mentioned earlier, lemmatization just like stemming reduces a word to its root form but for lemmatization we need to first tag the words with their parts of speech tags before carrying out the lemmatization. For example, every word that is verb will be given the tag verb(v), words that are noun will be given noun(n) tag and so on.

Let's first install the libraries that we will be using

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
Enter fullscreen mode Exit fullscreen mode

As a start, let's create a function for tagging the words. We will use wordnet for tagging the words.

def tag(doc):
    #POS tagging
    tagged_tokens = nltk.pos_tag(doc)
    return tagged_tokens
Enter fullscreen mode Exit fullscreen mode

Next, let's create a function for converting the parts of speech(pos) tags.

# function for converting tags
def pos_tag_wordnet(tagged_tokens):
    tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
    new_tagged_tokens = [(word, tag_map.get(tag[0].lower(), wordnet.NOUN))
                            for word, tag in tagged_tokens]
    return new_tagged_tokens
Enter fullscreen mode Exit fullscreen mode

Let's now tag the words in the word list from before, then convert the tags and print the output.

# tag the words
tagged_tokens = tag(word_list)
# convert the tags
wordnet_tokens = pos_tag_wordnet(tagged_tokens)
print(wordnet_tokens)
Enter fullscreen mode Exit fullscreen mode

Output
[('jumped', 'v'), ('friendship', 'n'), ('friends', 'n'), ('swimming', 'v'), ('creation', 'n'), ('stability', 'n'), ('writing', 'v'), ('realize', 'v'), ('mystery', 'n'), ('football', 'n'), ('mysteries', 'n'), ('created', 'v'), ('took', 'v')]
From the output, we can see we've got verbs(v) and nouns(n).

Let's now lemmatize the tagged words.

wnl = WordNetLemmatizer()

for word, tag in wordnet_tokens:
    print((word, wnl.lemmatize(word, tag)))
Enter fullscreen mode Exit fullscreen mode

Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stability')
('writing', 'write')
('realize', 'realize')
('mystery', 'mystery')
('football', 'football')
('mysteries', 'mystery')
('created', 'create')
('took', 'take')

Conclusion

In this article, we've learned about stemming and lemmatization, what they are and their differences. Both stemming and lemmatization are good techniques for text processing and they each have pros and cons.

Credits

Top comments (1)

Collapse
 
integerman profile image
Matt Eland

This is great! Thanks for sharing this!