Lemmatization in NLP

#nlp #python #beginners #machinelearning

Lemmatization is the process of reducing a word to its base or dictionary form, known as the lemma, by considering its context, part of speech, and linguistic rules.

Unlike stemming—which crudely chops off suffixes—lemmatization results in valid English words by using vocabulary and morphological analysis.

Stemming vs Lemmatization

Let’s explore a quick example using NLTK:


from nltk.stem import WordNetLemmatizer, PorterStemmer

stem = PorterStemmer()
lem = WordNetLemmatizer()

print(stem.stem('change'))    # chang
print(stem.stem('changes'))   # chang
print(stem.stem('changed'))   # chang

Output:
chang
chang
chang

As you can see, the stemmer removes suffixes but doesn’t care if the result is a valid word. Here, it reduces all forms to "chang", which isn't a real English word.

Now let’s try lemmatization:

print(lem.lemmatize('change'))    # change
print(lem.lemmatize('changes'))   # change
print(lem.lemmatize('changed'))   # changed (still unchanged!)

Output:
change
change
changed

Wait—why didn’t "changed" reduce to "change"? That’s because the lemmatizer defaults to noun if we don’t specify the part of speech (POS).

DEV Community

Lemmatization in NLP

Top comments (0)