Sunil Aleti

Posted on Jul 26, 2020

Don't blindly remove STOPWORDS for a Sentiment Analysis Model

#machinelearning #datascience #tutorial

Does removing stopwords really improve model performance?

Hey Peeps!!
Before creating any model, data preprocessing is must
Data preprocessing includes Data Cleaning, Data Transformation and Data Reduction

Data Cleaning:

It involves handling of missing data, noisy data etc..

Missing Data
Noisy Data

Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for the mining process

Normalization

Data Reduction:

While working with a huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction techniques.

Dimensionality Reduction

And I started working on Amazon Fine Food Review where I got dataset from Kaggle. The Main Objective of this model is to determine whether the review is positive or negative And I started data preprocessing before training the dataset

Steps in preprocessing:

Begin by removing the html tags.
Remove any punctuations or limited set of special characters like , or . etc.
Check if the word is made up of english letters and is not alpha-numeric
Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
Convert the word to lowercase
Remove Stopwords
Finally Snowball Stemming the word

And I used Naive Bayes algorithm to train my dataset and tested it, unfortunately, my model is underperforming.
After reviewing the model, I came to know that it is because of removing StopWords, yes you heard it right

The most common method to remove stopwords is using NLTK's stopwords.

import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
sno = nltk.stem.SnowballStemmer('english')
print(stop_words)

The main objective of building this model is to determine the given review is positive or negative but performing stopwords it removes the negative words which indeed it literally changes the whole meaning of the review i.e negative to postive
Ex:

Before Stopwords	After Stopwords
The product is really very good (Positive)	product really good(Positive)
The products seems to be good.(Positive)	products seems good (Positive)
Good product I really liked it(Positive)	Good product really liked (Positive)
I didn’t like the product (Negative)	like product (Positive)
The product is not good (Negative)	product good (Positive)

We can see the after stopwords the negative reviews also changed to positive.

A bit scary right?

If you are working with basic NLP techniques like BOW, W2V or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods but creating a new list or importing NLP from nlppreprocess is good

from nlppreprocess import NLP
import pandas as pd

nlp = NLP()
df = pd.read_csv('some_file.csv')
df['text'] = df['text'].apply(nlp.process)

def decontracted(phrase):
    phrase = re.sub(r"\'t", "not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

Now, it seems reasonable to use this package for the removal of stopwords and other preprocessing.
Let me know what is your opinion on this in the comment section.

Hope it's useful
A ❤️ would be Awesome 😊

DEV Community

Don't blindly remove STOPWORDS for a Sentiment Analysis Model

Does removing stopwords really improve model performance?

Data Cleaning:

Data Transformation:

Data Reduction:

Steps in preprocessing:

Top comments (0)

Read next

New Method Lets You Train 100B AI Models on a Single Consumer GPU, 2.6x Faster

Google's LearnLM: AI Model Gets Teaching Upgrade to Boost Educational Performance

Running Out of Space? Move Your Ollama Models to a Different Drive 🚀

Day 24: Thanks and Goodbye!