Sunil Aleti

Posted on Jul 26, 2020

Don't blindly remove STOPWORDS for a Sentiment Analysis Model

#machinelearning #datascience #tutorial

Does removing stopwords really improve model performance?

Hey Peeps!!
Before creating any model, data preprocessing is must
Data preprocessing includes Data Cleaning, Data Transformation and Data Reduction

Data Cleaning:

It involves handling of missing data, noisy data etc..

Missing Data
Noisy Data

Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for the mining process

Normalization

Data Reduction:

While working with a huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction techniques.

Dimensionality Reduction

And I started working on Amazon Fine Food Review where I got dataset from Kaggle. The Main Objective of this model is to determine whether the review is positive or negative And I started data preprocessing before training the dataset

Steps in preprocessing:

Begin by removing the html tags.
Remove any punctuations or limited set of special characters like , or . etc.
Check if the word is made up of english letters and is not alpha-numeric
Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
Convert the word to lowercase
Remove Stopwords
Finally Snowball Stemming the word

And I used Naive Bayes algorithm to train my dataset and tested it, unfortunately, my model is underperforming.
After reviewing the model, I came to know that it is because of removing StopWords, yes you heard it right

The most common method to remove stopwords is using NLTK's stopwords.

import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
sno = nltk.stem.SnowballStemmer('english')
print(stop_words)

The main objective of building this model is to determine the given review is positive or negative but performing stopwords it removes the negative words which indeed it literally changes the whole meaning of the review i.e negative to postive
Ex:

Before Stopwords	After Stopwords
The product is really very good (Positive)	product really good(Positive)
The products seems to be good.(Positive)	products seems good (Positive)
Good product I really liked it(Positive)	Good product really liked (Positive)
I didn’t like the product (Negative)	like product (Positive)
The product is not good (Negative)	product good (Positive)

We can see the after stopwords the negative reviews also changed to positive.

A bit scary right?

If you are working with basic NLP techniques like BOW, W2V or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods but creating a new list or importing NLP from nlppreprocess is good

from nlppreprocess import NLP
import pandas as pd

nlp = NLP()
df = pd.read_csv('some_file.csv')
df['text'] = df['text'].apply(nlp.process)

def decontracted(phrase):
    phrase = re.sub(r"\'t", "not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

Now, it seems reasonable to use this package for the removal of stopwords and other preprocessing.
Let me know what is your opinion on this in the comment section.

Hope it's useful
A ❤️ would be Awesome 😊

DEV Community

Don't blindly remove STOPWORDS for a Sentiment Analysis Model

Does removing stopwords really improve model performance?

Data Cleaning:

Data Transformation:

Data Reduction:

Steps in preprocessing:

Oldest comments (0)

Read next

Let's Build a Select Component in Blazor

C# Delegates In Practice — Implementing Observer Pattern With Delegates

Improve your productivity by using more terminal and less mouse (🚀).

Self-host AI interface builder OpenUI on Codesphere