Lemmatization is the process of reducing a word to its base or dictionary form, known as the lemma, by considering its context, part of speech, and linguistic rules.
Unlike stemming—which crudely chops off suffixes—lemmatization results in valid English words by using vocabulary and morphological analysis.
Stemming vs Lemmatization
Let’s explore a quick example using NLTK:
from nltk.stem import WordNetLemmatizer, PorterStemmer stem = PorterStemmer() lem = WordNetLemmatizer() print(stem.stem('change')) # chang print(stem.stem('changes')) # chang print(stem.stem('changed')) # chang
Output: chang chang chang
As you can see, the stemmer removes suffixes but doesn’t care if the result is a valid word. Here, it reduces all forms to "chang", which isn't a real English word.
Now let’s try lemmatization:
print(lem.lemmatize('change')) # change print(lem.lemmatize('changes')) # change print(lem.lemmatize('changed')) # changed (still unchanged!)
Output: change change changed
Wait—why didn’t "changed" reduce to "change"? That’s because the lemmatizer defaults to noun if we don’t specify the part of speech (POS).
Top comments (0)