Opinions are very important to a product data mining. They shape the improvement and corrections necessary for a product or service. It is from such opinions that companies are able to build end products that are customer centered and comfortable to use. Sentiment analysis can be expensive to do when it involves an actual counting to determine the quality of product as per the individual customer review. These are just among many other reasons why automated sentiment analysis is key to companies. In this article, we are going to discuss concepts that are crucial when getting started with sentiment analysis. In addition, we shall also see a simple roadmap on how to go about the analysis step by step.
Introduction
Sentiment analysis is also called Opinion mining, and it is the identification and classification of sentiments that come in text form in the field of natural language processing. To get a nice amount of sentiment data to analyze, then Twitter is among the best place to look. The data helps understand difference in opinion of people on social media about carious topics. To detect the sentiments of a customer, it is therefore paramount to develop an automated machine learning sentiment analysis model.
Source data: Sentiment140dataset
Machine learning pipeline content: Logistic regression, SVM and Bernoulli Naive Bayes as classifiers and Term Frequency-Inverse Document Frequency, NLTK library for natural language processing.
Now let us dive in to the exciting part of data analysis, the pipeline steps:
A) IMPORT LIBRARIES AND DEPENDANCIES
import numpy as np
import pandas as pd
import seaborn as sns
# For plotting purposess
import matplotlib.pplot as plt
import seaborn as sns
from wordcloud import WordCloud
#Skliearn tools
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# for extracting important feature in text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#natural language toolkit
from nltk.stem import WordNetLemmatizer
B)LOADING DATA INTO OUR NOTEBOOK
# we shall give a variable name data to load data into.
# Due to the volume of the data, we can specify the number of rows we can load, for this case we can use 5000 rows
data =pd.read_csv('file_path',nrows=5000)
There are some troubles that can arise while loading the data on jupyter notebook. We will handle one here; for instance, after loading the data and reading the head we find that part of our data is acting as the header instead as being part of the data rows, we can correct that using the following command instead:
#it will return the headers of the columns as numbers instead of the data
data =pd.read_csv('file_path',nrows=5000,header=None)
C) EDA(Explaratory Data Analysis)
At this point, we want to explore and discover different aspects and contents in the data we now have. I have written an article about this, in case you need some clarification on what exactly EDA is you can visit the article:
i) Read head
This enables us access the first five rows by default. We can use the following code to achieve this:
data.head()
From our data, it has 5 columns if correctly loaded.
ii)Shape
Basically, this is just to see the dimensions of our data in terms of the number of rows and columns. We use the following code to achieve this:
data.shape
iii) Length of data
To check out the length of the data, we use the following code:
len(data)
iv) Check for Null values
We want to see the data that we have loaded if it has any null value. To achieve this, we use the following statement:
data.isnull()
this will return the response for each row, which is a little cumbersome, so the best way is return the sum of null values, if they exist using the following command:
np.sum(data.isnull())
If correctly done, it returns the results for the number of null values in each of the five columns. We discover this particular data does not contain any null values.
v) Unique Values
We want now to Check in the 'target' column for any unique value that can be of interest to us. We use:
data['target'].unique()
for our case, 0 is the unique value and is of datatype int.
vi)Data Visualization
We can plot a graph using Jupyter notebook to graphically understand our data
D) PRE-PROCESSING
Removing noise(unnecessary data) from the data will help us have a higher accurate models. The noise in this kind of data include stopwords(like 'the', 'is','in'..), special characters, and hashtags if they exist. It is advisable to change data to lowercase for better generalization.
i) Create new dataset using Target and Content
We achieve this by selecting a sample for investigation from the data given. In our case we will use the 'Target' which is the classification of sentiment and the 'Tweet_Content' column and work with them:
sample = data[['Tweet_Content','Target']]
ii) Check for unique values
We can now check the unique values in our new dataset sample for column target using the following code:
sample['Target'].unique()
_iii) Divide the Target column _
We achieve this by assigning all 0 values to negative and 1 value to positive.
Due to the vastness of our data, we can choose a portion from each category to work with, for our case we will work with 0.25 million from each case as shown below:
positive = positive.iloc[:int(25000)]
negative = positive.iloc[:int(25000)]
iv) Form new dataset
Now we form a new data set with 500,000 dataset with both negative and positive representation using the following code:
new_dataset = pd.concat([positive,negative])
v) Change tweet content to lower case
We want to have data that is generally uniform. Good news is that Python has a Library that help us turn text into lower case or upper case. For our case we will turn to lowercase using the following code:
new_dataset['Tweet_Content']=new_dataset['Tweet_Content'].str.lower()
vi)Remove all stopwords
We want to get read of all stopwords. We start by creating a list that contain all possible stop words and we will use it to create a method to remove stop words from our data. We create the list as shown:
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was', 'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
"youve", 'your', 'yours', 'yourself', 'yourselves']
Now let us create a method to remove all these stop words as shown below:
STOPWORDS = set(stopwordlist)#turn it into a set
def cleaning_stopwords(text):
return " ".join([word for word in str(text).split() if word not in STOPWORDS])
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda text: cleaning_stopwords(text))
new_dataset['Tweet_Content'].head()
The above code returns our data in the Tweet content minus the stop words we have defined in the stopword list.
vii) Remove Punctuation
Punctuation marks can hinder the accuracy of our model, luckily, Python has a string library that contains English words and punctuation, which is very helpful right now. Hence we will remove them in this section as shown below:
import string
eng_punctuations = string.punctuation
punctuations_list = eng_punctuations
def cleaning_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda text: cleaning_punctuations(text))
new_dataset['Tweet_Content'].head()
These returns the first five text without panctuation marks.
viii) Remove Repeating Words
To remove repeating words we use the following piece of code:
we will import the re library(A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).) just in case it is not present using the following code:
import re
now let us define a method to remove repeating words:
def cleaning_repeating_char(text):
return re.sub(r'(.)1+', r'1', text)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: cleaning_repeating_char(x))
new_dataset['Tweet_Content'].tail()
viii) Remove urls
In twitter, sometimes we find comments having urls to refer the targeted audience to another location on the internet regarding their interest at the time. We will define a method to achieve this:
def cleaning_URLs(data):
return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: cleaning_URLs(x))
new_dataset['Tweet_Content'].head()
ix) Remove numeric characters
At this point, we remove numbers from the tweets cause we want our data to be as clean as possible. We achieve this using the following code:
def remove_numbers(data):
return re.sub('[0-9]+', '', data)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: remove_numbers(x))
new_dataset['Tweet_Content'].head()
x) Tokenizing Tweet Content
Tokenization is splitting paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words). It is can be achieved using the following example:
from nltk.tokenize import RegexpTokenizer
def tokenize_tweet(tweet):
return nltk.word_tokenize(tweet)
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content'].apply(tokenize_tweet)
new_dataset.head(10)
xi) Lemmatization
Lemmatization in Natural Language Processing(NLP) helps models break a word down to its root meaning to identify similarities.
lam = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
text = [lm.lemmatize(word) for word in data]
return data
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content_Token'].apply(lambda x: lemmatizer_on_text(x))
new_dataset['Tweet_Content_Token'].head()
xii) Stemming the words
Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the roots of words known as "lemmas".It is useful for indexing words. It is done as shown in the example below:
def stemming_on_text(data):
text = [st.stem(word) for word in data]
return data
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content_Token'].apply(lambda x: stemming_on_text(x))
new_dataset['Tweet_Content_Token'].head()
xiii) Labeling axes
At this point, we have to separate the input feature and label. Our Tweet Content is constant, it cannot be changed once is expressed. What changes is the label of the tweet being either positive or negative. We, therefore, assign the X label the Tweet Content which is constant and the y label the sentiment as shown in the example:
X=data.Tweet_Content
y=data.Target
xiv) Cloud of Words
Sometimes words change meaning with change in context. Cloud of words help us organize this words into context of either negative cloud to help us develop an AI model. Negative cloud of words is achieved through the following example
positive = data['text'][:800000]
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(positive))
plt.figure(figsize = (20,20))
plt.imshow(wc)
xv)Split data into training and testing data
We split our data into training data and testing data as shown in the example:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.05, random_state =26105111)
E) CREATING MODELS
We have processed the data so far to our liking and now we can fit it and create a model. We shall use TF-IDF vectorizer to fit the data as shown in the example:
vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
print('Feature_words count: ', len(vectoriser.get_feature_names()))
We can also use the vectorizer to transform out training and testing data as shown in the example:
X_train = vectoriser.transform(X_train)
X_test = vectoriser.transform(X_test)
First Model: Bernoulli Naive Bayes Classifier
We shall specify the score of our models using the following example:
def model_Evaluate(model):
y_predict = model.predict(X_test)
print(classification_report(y_test, y_predict))
# Compute and plot the Confusion matrix
cf_matrix = confusion_matrix(y_test, y_predict)
categories = ['Negative','Positive']
group_names = ['True Neg','False Pos', 'False Neg','True Pos']
group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
labels = [f'{v1}n{v2}' for v1, v2 in zip(group_names,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
xticklabels = categories, yticklabels = categories)
plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)
Now to the model:
BNBmodel = BernoulliNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
y_predict1 = BNBmodel.predict(X_test)
Second Model: Logistic Regression
We also use the second model to see the accuracy as shown in the example:
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)
y_pred3 = LRmodel.predict(X_test)
Third Model: SVM
The last model is SVM and afterwards we will do comparison using the three models and choose which one is preferred:
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
y_pred2 = SVCmodel.predict(X_test)
CONCLUSION
From analysis on basis of score, logistic regression has the highest score among the three models. You can find the code on Jupyter notebook for further reference on my github account https://github.com/Gamalie/Data_Science-BootCamp/blob/main/Twitter%20Sentiment%201.ipynb
Top comments (0)