Stemming in NLP

#python #beginners #nlp #machinelearning

Stemming is the process of reducing a word to its root or base form by chopping off suffixes or prefixes, usually based on heuristic rules. It does not take context into account or ensure that the result is a valid word.

For example, in search engines or digital libraries, a user might search for "investing," but expect results for all related forms like "invest," "investment," or "invested." Stemming helps retrieve such related results by reducing these variations to a common stem.

Inflected Words
Inflected words change form to express tense, number, case, or gender.
Examples:

run, running, runner

All are derived from the same root but used in different grammatical contexts.

Morphology

Morphology is a branch of linguistics that studies how words are formed from smaller units called morphemes.

Root Morpheme: Carries the core meaning.
Examples: run, talk
Affixes: Modify the meaning of the root.
Examples: re-, -ing, -ed, -er

Stemming with NLTK

We'll use NLTK’s PorterStemmer, a rule-based stemmer widely used in NLP tasks.

1. Install NLTK and Download Resources
In your Python environment, run the following:

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

wordnet is more useful for lemmatization, but we download it here as it's often needed together in NLP tasks.

2. Run Stemming Code

from nltk.stem import PorterStemmer, WordNetLemmatizer

stem = PorterStemmer()
lem = WordNetLemmatizer()

print(stem.stem('connect'))     # connect
print(stem.stem('connected'))   # connect
print(stem.stem('connecting'))  # connect
print(stem.stem('connection'))  # connect
print(stem.stem('reconnect'))   # reconnect

Output:
connect
connect
connect
connect
reconnect

As you can see:

PorterStemmer removes suffixes like -ed, -ing, -ion.
It does not remove prefixes like re-, so reconnect stays unchanged.

DEV Community

Stemming in NLP

Top comments (0)