Karen Ngala

Posted on Mar 22, 2023 • Edited on Mar 27, 2023

Getting started with Sentiment Analysis

#datascience #sentimentanalysis #beginners #machinelearning

Pre-reading:

Basic understanding of EDA

What is Sentiment Analysis?

Humans communicate with each other using Natural Language, which is often complicated. Humans tend to use subtle variations in their speech, such as sarcasm, which is easy for us to interpret but difficult for machines. To make computers understand Natural language, we use a process known as Natural Language Processing (NLP)

Sentiment analysis, also known as opinion mining, is a an approach to natural language processing that seeks to identify the emotion behind a text such as movie or product reviews. Businesses around the world use sentiment analysis to understand the social opinion on their products or services left on online platforms.

Sentiment analysis identifies, classifies, and quantifies the sentiment expressed in a text. For example, the text "I loved the movie" carries a positive sentiment while "I found it rather slow and boring" carries a negative sentiment. Positive or negative text can further be quantified in text, for example, the text "I really enjoyed the movie" can be quantified as 'relatively more positive'. The amount of positivity or negativity in text is known as polarity.

When a large amount of data is involved, it becomes more effective to use an algorithm to determine customer satisfaction as opposed to humans.

Sentiment Analysis Process

1. Import relevant libraries

There are a number of libraries we can use in sentiment analysis depending on your goals.

Pandas — for data analysis and manipulation import pandas as pd
Matplotlib — for data visualization import matplotlib.plyplot as plt
Seaborn — for high-level data visulaization import seaborn as sns
WordCloud - to visualize text data. The more a word appears in the text, the larger the font of the word. from wordcloud import WordCloud
re — for string pre-processing. Formats string according to a given regular expression import re
nltk — Natural Language Toolkit. It is a collection of libraries used in Natural Language Processing. import nltk
stopwords — A collection of words that do not offer sentiment in a sentence, such as "the", "and" from nltk.corpus import stopwords

Evaluation Libraries:

from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report, confusion_matrix

Once we have trained our model, we need to evaluate the correctness of the model using the testing dataset i.e: is the result what we expect it to be?

Accuracy Score — Ratio of correctly classified instances to the total number of instances.
Precision Score — Ratio of correctly classified instances to the total positive instances.
Recall Score — Ratio of correctly classified instances to the total number of instances.
Classification Report — a report of accuracy, precision, and recall scores
ROC Curve — a graph of Sensitivity/True Positive Rate (y-axis) against Specificity/False Positive Rate (x-axis) at various threshold values. An ROC “Receiver Characteristic Operator” curve summarizes the performance of a binary classification model.

A binary classification model is one that classifies an instance as either one thing or the other, i.e: The output can only be this value or the other. 'Sick' or 'Not Sick', 'Cat' or 'Dog', 'Tree' or 'Not Tree'

2. Load the dataset

A sample sentiment analysis dataset will contain a text column and its corresponding sentiment/target value.

To read the dataset, we need to load it using pandas:

df = pd.read_csv("train.csv")
df_test = pd.read_csv("text.csv")

3. Exploratory Data Analysis

Understand the data you are working with. Check various aspects of the dataset to familiarize yourself with it. This will help you know how you can manipulate the dataset.

df.shape

df.head()

df.dtypes

# Check for null values
np.sum(data.isnull().any(axis=1))

Distribution of target variables:

The next step is to check the various target sentiments in the dataset.

df['label'].value_counts()
# or
sns.countplot(df.label)

In cases where the labels are of more than two types, we can merge them to create two simple sentiments, positive and negative represented in a numerical form: '1' and '0'

4. Data Preparation

Dealing with alphanumeric text requires pre-processing to remove any odd characters and prepare the text for the model.

Covert the text to lowercase. Because of case sensitivity, the word "Hello" is different from "hello"
```
df['text']=df['text'].str.lower()
```

Remove any stopwords. Words such as "the", "and" do not offer much value in sentiment analysis

stopwords_list = stopwords.words('english')

from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

# Get rid of any stopwords
STOPWORDS = set(stopwords.words('english'))
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
df['text'] = df['text'].apply(lambda text: cleaning_stopwords(text))
df['text'].head()

Remove non-alphabetic characters.

# remove special characters, numbers and punctuations
df['text'] = df['text'].str.replace("[^a-zA-Z#]", " ")
df.head()

# remove short words
df['text'] = df['text'].apply(lambda x: " ".join([w for w in x.split() if len(w)>2]))
df.head()

Depending on the data you are dealing with, you may need to remove different characters and character combinations. For example, when handling twitter data, you will need to remove user handles, i.e: "@username"

# function to remove patterns in the input text.
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for word in r:
        input_txt = re.sub(word, "", input_txt)
    return input_txt

# remove twitter handles (@user)
df['text'] = np.vectorize(remove_pattern)(df['text'], "@[\w]*")
df.head()

4. Tokenization

This is used in natural language processing to split text into smaller units that can be more easily assigned meaning. For example, the string "Loved the ambiance and drinks". Tokenization is performed to break the string into individual parts that the program can understand better: 'Loved', 'the', 'ambiance', 'and', 'drinks'

This step also lays the ground work for stemming or lemmatization. Learn more on this topic here.

tokenizer = RegexpTokenizer(r'\w+')
df['text'] = df['text'].apply(tokenizer.tokenize)
df['text'].head()

5. Lemmatization

This is the process of deriving the root word from the different forms of the word. For example the words eats, eating are all part of the same lexeme; with eat as the lemma.

Lemmatization is computationally expensive since it involves look-up tables.
Unlike Stemming which looks at word reduction, lemmatization considers a language's vocabulary to derive the base word. Base words in stemming don't always make sense. For example, the word 'having' would return 'hav' in stemming and 'have' in lemmatization.

lm = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
    text = [lm.lemmatize(word) for word in data]
    return data

df['text'] = df['text'].apply(lambda x: lemmatizer_on_text(x))

df['text'].head()

5. Prepare for training

The next step is to separate the dataset into training data and testing data. Sentiment analysis is a classification problem. As such, a classification model is trained using the training dataset and evaluated using the testing dataset. The ratio of training data to testing data is usually 1:1 or 4:1 to avoid biasing the model.

The purpose of this step is to ensure the data you use to evaluate your model's accuracy is unseen/new data. Testing a model using the training data will cause the model to only perform well with the training data and not any other data. This is known as overfitting; and the opposite known as underfitting.

Accuracy score allows us to evaluate the model's performance. We compare the training accuracy to the testing accuracy to identify underfitting and overfitting.
If the training accuracy is extremely high while the testing accuracy is poor then this is a good indicator that the model is probably overfitted.

In cases where we need to choose between multiple models, we need to create an extra dataset known as the validation dataset. This allows us to evaluate the models to pick which performs better.

There are many ways to split your dataset. The following is one method that utilizes sklearn. Read more about how to split a dataset.

# This splits data into an 80:20 ratio
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)

6. Build Model

The model you choose to use here is not set in stone. A popular choice for sentiment analysis is Logistic regression. This is because it trains quickly even on large datasets and provides very robust results. Other model choices include Random Forests, and Naive Bayes.

Q: What if we do not have labelled data? How can we know the sentiment in a text?
A: Using Pre-Trained Models — TextBlob

TextBlob is a library that returns the sentiment of a text as a named tuple: "(polarity, subjectivity)”

Polarity is a float in the range -1.0 and 1.0. It shows whether a text is negative or positive.
Subjectivity is a float in the range 0.0 and 1.0 to represent very objective and very subjective sentiments respectively.

7. Model Evaluation

After training the model, we evaluate the performance of the model. Assessing the model's efficiency answers the question, Is the model working well with unseen data?

Before going into the evaluation metrics we can use, let's define the results we can get from these metrics.
For these definitions, let's use the example of a model classifying patients as "Sick" or "Not Sick"

True Positive(TP) - the number of Sick people that were correctly classified as Sick.
True Negative(TN) - the number of Not Sick people that were correctly classified as Not Sick.
False Positive(FP) - the number of Not Sick people that were wrongly classified as Sick.
False Negative(FN) - the number of Sick people that were wrongly classified as Not Sick.
N - total number of patients

There are many evaluation metrics. However, we will look at 3 popular metrics used for classification models:

Accuracy — How often does the model make correct predictions? i.e: The actual sentiment and the predicted sentiment are the same.
```
# Testing accuracy
print('Test set\n  Accuracy: {:0.2f}'.format(accr1[1]))
```
Confusion Matrix — a table used to visualize the performance of a classification model on a dataset for which the true (target) values are known. A confusion matrix highlights two errors:
- Type 1 Error - The number of instances that were negative but were wrongly classified as positive. Also called, False Positive(FP)
- Type 2 Error - The number of instances that were positive but were wrongly classified as negative. Also called, False Negative(FN)
```
print('\n')
print("confusion matrix")
print('\n')
CR=confusion_matrix(Y_test, y_pred)
print(CR)
print('\n')

fig, ax = plot_confusion_matrix(conf_mat=CR,figsize=(10, 10),
                                show_absolute=True,
                                show_normed=True,
                                colorbar=True)
plt.show()
```
AUC (Area Under the ROC Curve) — calculated by plotting the true positive rate against the false positive rate at different classification thresholds.
- True Positive Rate (sensitivity) - proportion of positive samples that are correctly identified as positive
- False positive rate (1-specificity) - is the proportion of negative samples that are incorrectly classified as positive.
- True Negative Rate (Specificity) - proportion of negative samples that are correctly identified as negative