Stemming is the process of reducing a word to its root or base form by chopping off suffixes or prefixes, usually based on heuristic rules. It does not take context into account or ensure that the result is a valid word.
For example, in search engines or digital libraries, a user might search for "investing," but expect results for all related forms like "invest," "investment," or "invested." Stemming helps retrieve such related results by reducing these variations to a common stem.
Inflected Words
Inflected words change form to express tense, number, case, or gender.
Examples:
run, running, runner
All are derived from the same root but used in different grammatical contexts.
Morphology
Morphology is a branch of linguistics that studies how words are formed from smaller units called morphemes.
Root Morpheme: Carries the core meaning.
Examples: run, talk
Affixes: Modify the meaning of the root.
Examples: re-, -ing, -ed, -er
Stemming with NLTK
We'll use NLTK’s PorterStemmer, a rule-based stemmer widely used in NLP tasks.
1. Install NLTK and Download Resources
In your Python environment, run the following:
import nltk nltk.download('wordnet') nltk.download('omw-1.4')
wordnet
is more useful for lemmatization, but we download it here as it's often needed together in NLP tasks.
2. Run Stemming Code
from nltk.stem import PorterStemmer, WordNetLemmatizer stem = PorterStemmer() lem = WordNetLemmatizer() print(stem.stem('connect')) # connect print(stem.stem('connected')) # connect print(stem.stem('connecting')) # connect print(stem.stem('connection')) # connect print(stem.stem('reconnect')) # reconnect
Output: connect connect connect connect reconnect
As you can see:
PorterStemmer removes suffixes like -ed, -ing, -ion.
It does not remove prefixes like re-, so reconnect stays unchanged.
Top comments (0)