DEV Community

Cover image for Stop Words Removal - NLP
datatoinfinity
datatoinfinity

Posted on • Edited on

Stop Words Removal - NLP

Stopwords are common words, like "a," "the," and "is," that are often removed from text before analysis because they don't carry much meaning on their own.

Examples of Stopwords:

  • Articles: a, an, the
  • Conjunctions: and, but, or
  • Prepositions: in, on, at, with
  • Pronouns: he, she, it, they
  • Common verbs: is, am, are, was, were, be, being, been

Sentence with stop words:

This is a book that is about the history of the world in a very detailed and interesting way.

Total Words: 18

Stop words:
this, is, a, that, about, the, of, in, a, and

Number of stop words: 11

Sentence After Removing Stop Words:

Book history world very detailed interesting way.

Total number of words: 7

Explanation:

  1. We do not want those words which take up space in database or take valuable processing time.
  2. You will see this happening while searching on google or in chatgpt.
  3. For search engine you perfect english is not going to helping them. Only the keywords will matter to them.

Removing StopWords using nltk

1. You need to install stop word package.
code

import nltk
nltk.download('stopwords')

2. Now in you code editor write:

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

Do remember if you don't write english as you language it will have stop words of every language.

3. Now we will take a text and tokenize that text into words.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_word=stopwords.words('english')
txt='This is a book that is about the history of the world in a very detailed and interesting way.'
txt=word_tokenize(txt)
print(txt)
Output:
['This', 'is', 'a', 'book', 'that', 'is', 'about', 'the', 'history', 'of', 'the', 'world', 'in', 'a', 'very', 'detailed', 'and', 'interesting', 'way', '.']

4. Now we will take a loop where we go word by word and see if there any stop word.

txt='This is a book that is about the history of the world in a very detailed and interesting way.'
txt=word_tokenize(txt)
for word in txt:
    print(word,word in stop_word)
Output:
This False
is True
a True
book False
that True
is True
about True
the True
history False
of True
the True
world False
in True
a True
very True
detailed False
and True
interesting False
way False
. False

Every True value says it is a stop word and every False value says it not a stop word. But here is catch when you will use this command print(stopwords.words('english')) you will see 'this' is a stop word but in lower case.

code

So do remember that before removing stoping word make it in lower case other wise you will get wrong answer. In programming language 'This' and 'this' consider different words because language are case sensitive. So to make it right.

txt='This is a book that is about the history of the world in a very detailed and interesting way.'
txt=word_tokenize(txt)
for word in txt:
    print(word,word.lower() in stop_word)
This True
is True
a True
book False
that True
is True
about True
the True
history False
of True
the True
world False
in True
a True
very True
detailed False
and True
interesting False
way False
. False

5. Removing the Stop Word:

for word in txt:
    if ((word.lower() not in stop_word)):
        print(word)

So there is condition if word is not stop word then print the word.

Output:
book
history
world
detailed
interesting
way
.

Now you can see full stop is also in the output because it is not a stop word but we want to remove so can another condition if length of the word is greater than 2 then print the list.

for word in txt:
    if ((word.lower() not in stop_word) and (len(word)>=2)):
        print(word)
Output:
book
history
world
detailed
interesting
way

Top comments (0)