Stopwords are common words, like "a," "the," and "is," that are often removed from text before analysis because they don't carry much meaning on their own.
Examples of Stopwords:
- Articles: a, an, the
- Conjunctions: and, but, or
- Prepositions: in, on, at, with
- Pronouns: he, she, it, they
- Common verbs: is, am, are, was, were, be, being, been
Sentence with stop words:
This is a book that is about the history of the world in a very detailed and interesting way.
Total Words: 18
Stop words:
this
, is
, a
, that
, about
, the
, of
, in
, a
, and
Number of stop words: 11
Sentence After Removing Stop Words:
Book history world very detailed interesting way.
Total number of words: 7
Explanation:
- We do not want those words which take up space in database or take valuable processing time.
- You will see this happening while searching on google or in chatgpt.
- For search engine you perfect english is not going to helping them. Only the keywords will matter to them.
Removing StopWords using nltk
1. You need to install stop word package.
import nltk nltk.download('stopwords')
2. Now in you code editor write:
import nltk from nltk.corpus import stopwords print(stopwords.words('english'))
Do remember if you don't write english as you language it will have stop words of every language.
3. Now we will take a text and tokenize that text into words.
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize stop_word=stopwords.words('english') txt='This is a book that is about the history of the world in a very detailed and interesting way.' txt=word_tokenize(txt) print(txt)
Output: ['This', 'is', 'a', 'book', 'that', 'is', 'about', 'the', 'history', 'of', 'the', 'world', 'in', 'a', 'very', 'detailed', 'and', 'interesting', 'way', '.']
4. Now we will take a loop where we go word by word and see if there any stop word.
txt='This is a book that is about the history of the world in a very detailed and interesting way.' txt=word_tokenize(txt) for word in txt: print(word,word in stop_word)
Output: This False is True a True book False that True is True about True the True history False of True the True world False in True a True very True detailed False and True interesting False way False . False
Every True value says it is a stop word and every False value says it not a stop word. But here is catch when you will use this command print(stopwords.words('english')) you will see 'this' is a stop word but in lower case.
So do remember that before removing stoping word make it in lower case other wise you will get wrong answer. In programming language 'This' and 'this' consider different words because language are case sensitive. So to make it right.
txt='This is a book that is about the history of the world in a very detailed and interesting way.' txt=word_tokenize(txt) for word in txt: print(word,word.lower() in stop_word)
This True is True a True book False that True is True about True the True history False of True the True world False in True a True very True detailed False and True interesting False way False . False
5. Removing the Stop Word:
for word in txt: if ((word.lower() not in stop_word)): print(word)
So there is condition if word is not stop word then print the word.
Output: book history world detailed interesting way .
Now you can see full stop is also in the output because it is not a stop word but we want to remove so can another condition if length of the word is greater than 2 then print the list.
for word in txt: if ((word.lower() not in stop_word) and (len(word)>=2)): print(word)
Output: book history world detailed interesting way
Top comments (0)